Analysing Dengue Cases in Singapore

Problem Statement

In our Capstone Project, we aim to explore the relationship between meteorological conditions (air temperature and rainfall) of Singapore and clinical conditions commonly seen in local community – Dengue fever (DF), hand food mouth disease (HFMD), upper respiratory tract infection (URTI) and diarrhea.

Specifically, we aim to answer the following questions: Does higher air temperature and rainfall associated with the rise in clinical conditions in the community? And which region of Singapore has the highest number of DF cases?

Activate packages

# Rule for activating packages in development mode:
# - Specify everything fully
# - Exceptions:
#   - ggfortify - to overload `ggplot2::autoplot()`
#   - ggmap - has to be activated for `ggmap::register_google()` to work
#   - ggplot2 - too troublesome otherwise
#   - magrittr - `%>%`
#
# For deployment mode, activate every package that is referenced more than once
# - Add package name here, then use find-and-replace to remove "<package>::"
# - Allowed to activate tidyverse
#   - Remember what are the 8 packages?

pacman::p_load(
  ggfortify,
  # ggmap,
  ggplot2,
  magrittr
  # tidyverse
)

Import

Most time-consuming, finding data, is.

All data are available online and first gathered into separate distinct raw datasets:

Click here for a summary page. The full glorious details of their providence are given below.

Helper functions

The following functions must be activated no matter which method of data import is chosen.

import_moh_weekly <- function(url_or_path = NULL) {
  #' Weekly Infectious Diseases Bulletin
  #' 
  #' @description
  #' \href{https://www.moh.gov.sg/resources-statistics/infectious-disease-
  #' statistics/2020/weekly-infectious-diseases-bulletin}{MOH Weekly Infectious 
  #' Disease Bulletin} from the Ministry of Health (MOH). 
  #' 
  #' \href{https://www.moh.gov.sg/docs/librariesprovider5/diseases-updates/
  #' weekly-infectious-disease-bulletin-year-2020
  #' d1092fcb484447bc96ef1722b16b0c08.xlsx}{Latest data as of 31 July 2020 
  #' (2012-W01 to 2020-W30)}.
  #' 
  #' @param url_or_path The URL or file path of the .xlsx file.
  #' @return Weekly infectious diseases bulletin (2012-W01 to 2020-W30).
  
  # If no URL or path is specified, try to get the file directly from MOH.
  if (is.null(url_or_path)) {
    url_or_path = paste0(
      "https://www.moh.gov.sg/docs/librariesprovider5/diseases-updates/",
      "weekly-infectious-disease-bulletin-year-2020",
      "d1092fcb484447bc96ef1722b16b0c08.xlsx"
    )
  }
  
  # Check if the given path is a URL by trying to download to a temp file. If 
  #   successful, return the temp file. If not, return the original path.
  xlsx_file = tryCatch({
    temp = tempfile(fileext = ".xlsx")
    download.file(url_or_path, destfile = temp, mode = "wb")
    temp
  }, error = function(e) {
    url_or_path
  })
  
  # Columns will be renamed to follow 2020
  colnames_2020 = c(
    "Campylobacter enterosis" = "Campylobacter enteritis",
    "Campylobacterenterosis" = "Campylobacter enteritis",
    "Campylobacteriosis" = "Campylobacter enteritis",
    "Chikungunya Fever" = "Chikungunya",
    "Dengue Haemorrhagic Fever" = "DHF",
    "Dengue Fever" = "Dengue",
    "Hand, Foot and Mouth Disease" = "HFMD",
    "Hand, Foot Mouth Disease" = "HFMD",
    "Nipah virus infection" = "Nipah",
    "Viral Hepatitis A" = "Acute Viral Hepatitis A",
    "Viral Hepatitis E" = "Acute Viral Hepatitis E",
    "Zika Virus Infection" = "Zika",
    "Zika virus infection" = "Zika"
  )
  
  xlsx_file %>%
    readxl::excel_sheets() %>% 
    lapply(function(sheetname) {
      df = readxl::read_xlsx(xlsx_file, sheetname, skip = 1)
      
      # Date formats are different for 2020 (dmy instead of mdy)
      if (sheetname == "2020") {
        df$Start = lubridate::dmy(df$Start)
        df$End = lubridate::dmy(df$End)
      }
      
      # Find and rename columns that need to be renamed, and rename them
      mapper = na.omit(colnames_2020[names(df)])
      dplyr::rename_with(df, ~mapper, names(mapper))
    }) %>% 
    dplyr::bind_rows() %>% 
    dplyr::rename(Epiweek = `Epidemiology Wk`) %>% 
    dplyr::mutate(Epiyear = lubridate::epiyear(Start)) %>% 
    dplyr::select(Epiyear, everything()) %>% 
    dplyr::arrange(Start)
}

read_kmls <- function(url_or_path) {
  #' Read Single or Multiple KML Files
  #' 
  #' @description
  #' There are (at least) 2 approaches to handling .kml data:
  #' \enumerate{
  #'   \item sp - rgdal::readOGR()
  #'   \item sf - sf::st_read()
  #' }
  #'
  #' The sp approach was abandoned as their objects were rather complex and 
  #'   did not facilitate method chaining.
  #'
  #' The sf approach produced objects that look like data.frames, which had 
  #'   better support for method chaining, but also some peculiarities:
  #' \itemize{
  #'   \item Dimension: set to XY using sf::st_zm()
  #'   \item Geographic CRS: use sf::`st_crs<-`("WGS84") World Geodetic 
  #'         Survey 1984
  #' }
  #' 
  #' @param url_or_path The URL(s) or file path(s) of the .kml file(s).
  #' @return A single combined sf object.
  
  # Check if the given paths are URLs by trying to download to temp files. If 
  #   successful, return the temp files. If not, return the original paths. 
  #   Automatically extract .zip files, if any.
  kml_files = tryCatch({
    temp = tempfile(fileext = paste0(".", tools::file_ext(url_or_path)))
    Map(function(x, y) download.file(x, y, mode = "wb"), url_or_path, temp)
    sapply(temp, function(x) {
      if (endsWith(x, ".zip")) {
        unzip(x)
      } else {
        x
      }
    })
  }, error = function(e) {
    url_or_path
  })
  
  kml_files %>% 
    lapply(sf::st_read, quiet = T) %>% 
    dplyr::bind_rows() %>% 
    tibble::as_tibble() %>% 
    sf::st_as_sf() %>% 
    sf::st_zm()
}

Import data de novo (where available)

The following code chunk has been disabled, but is provided here to show the details behind the webscraping processes.

import_mss_daily <- function(years, stations = NULL) {
  #' Historical Daily Weather Records
  #' 
  #' @description
  #' \href{http://www.weather.gov.sg/climate-historical-daily/}{Daily weather 
  #' records} from the Meteorological Service Singapore (MSS). 
  #' Available data ranges from January 1980 to June 2020. Only 19 of the 63 
  #' climate stations are recognized by this function, because these contain 
  #' more than just rainfall data. \href{http://www.weather.gov.sg/wp-content/
  #' uploads/2016/12/Station_Records.pdf}{List of stations, weather parameters 
  #' and periods of records}.
  #' 
  #' MSS is nice enough to have their data in .csv files, with a systematic 
  #' naming scheme. Data compilation becomes as simple as generating the 
  #' appropriate list of URLs.
  #' 
  #' @param years A vector of the years of interest.
  #' @param stations A vector of the climate station names.
  #' @return Combined daily weather records.
  #' @examples
  #' import_mss_daily(2012:2020, "Changi")
  #' import_mss_daily(2012:2020, c("Changi", "Clementi", "Khatib", "Newton"))
  
  stations_lookup = c(
    "Admiralty" = "104_",
    "Ang Mo Kio" = "109_",
    "Changi" = "24_",
    "Choa Chu Kang (South)" = "121_",
    "Clementi" = "50_",
    "East Coast Parkway" = "107_",
    "Jurong (West)" = "44_",
    "Jurong Island" = "117_",
    "Khatib" = "122_",
    "Marina Barrage" = "108_",
    "Newton" = "111_",
    "Pasir Panjang" = "116_",
    "Pulau Ubin" = "106_",
    "Seletar" = "25_",
    "Sembawang" = "80_",
    "Sentosa Island" = "60_",
    "Tai Seng" = "43_",
    "Tengah" = "23_",
    "Tuas South" = "115_"
  )
  
  # Check that all provided station names are in the list, if not, exit and 
  #   print out the list (of names) for the user.
  mask = !(stations %in% names(stations_lookup))
  if (any(mask)) {
    stop("The following station names are not recognized:\n",
         paste(stations[mask], collapse = "\n"),
         "\n\nPlease select from the following:\n",
         paste(names(stations_lookup), collapse = "\n"))
  }
  
  # If no station names specified, take the full list
  if (is.null(stations)) {
    station_nums = stations_lookup
  } else {
    station_nums = stations_lookup[stations]
  }
  
  # The URL to each .csv file has the following format:
  # - "<url_base><station number>_<YYYY><MM>.csv"
  # - Notice the station numbers given above contain the "_" suffix
  url_base = "http://www.weather.gov.sg/files/dailydata/DAILYDATA_S"
  
  # To enumerate all the URLs, take the Cartesian product of:
  # - station numbers
  # - years
  # - months (we'll always attempt to find all 12 months)
  #
  # Flatten the result into a vector, then prefix and suffix with `url_base` 
  #   and ".csv", respectively.
  
  # Base R
  urls = station_nums %>% 
    outer(years, FUN = paste0) %>% 
    outer(sprintf("%02d", 1:12), FUN = paste0) %>% 
    sort() %>%  # Flatten and arrange in alphabetical order
    paste0(url_base, ., ".csv")
  
  # Equivalent tidyverse approach (for reference)
  # urls = station_nums %>% 
  #   tidyr::crossing(years, sprintf("%02d", 1:12)) %>% 
  #   tidyr::unite("station_year_month", sep = "") %>% 
  #   dplyr::pull() %>% 
  #   paste0(url_base, ., ".csv")
  
  # We have a list of URLs, but some of them may not exist (some stations do 
  #   not have data for some months). Use tryCatch() to return a table for the 
  #   URLs that exist, and a NA for those that don't.
  #
  # Problems with multibyte characters and countermeasures:
  # - Some colnames contained the "degree sign" symbol which somehow made the 
  #   table contents inaccessible. Tables after April 2020 had slightly 
  #   different colnames.
  #   - Manually set the colnames for each table.
  # - Some tables contained the "em dash" symbol to indicate missing values.
  #   - They will be coerced to NA when the columns are type cast to numeric.
  #   - There will be warning messages.
  dfs = urls %>% 
    lapply(function(url) {
      tryCatch({
        df = readr::read_csv(url,
                             skip = 1,
                             col_names = c(
                               "Station",
                               "Year",
                               "Month",
                               "Day",
                               "Rainfall",
                               "Highest_30_min_rainfall",
                               "Highest_60_min_rainfall",
                               "Highest_120_min_rainfall",
                               "Mean_temp",
                               "Max_temp",
                               "Min_temp",
                               "Mean_wind",
                               "Max_wind"
                             ),
                             col_types = readr::cols_only(
                               "Station" = "c",
                               "Year" = "n",
                               "Month" = "n",
                               "Day" = "n",
                               "Rainfall" = "n",
                               "Mean_temp" = "n",
                               "Max_temp" = "n",
                               "Min_temp" = "n"
                             ))
        
        # Announce progress (UX is important! We can tolerate lower efficiency)
        message(paste(df[1, 1:3], collapse = "_"))
        
        df
      }, error = function(e) {
        NA
      })
    })
  
  dfs[!is.na(dfs)] %>% 
    dplyr::bind_rows() %>% 
    # Calculate daily temperature range
    dplyr::mutate(Temp_range = Max_temp - Min_temp,
                  .keep = "unused") %>% 
    # Calculate epidemiological years and weeks
    dplyr::mutate(Date = lubridate::ymd(paste(Year, Month, Day, sep = "-")),
                  Epiyear = lubridate::epiyear(Date),
                  Epiweek = lubridate::epiweek(Date),
                  .keep = "unused",
                  .after = Station) %>% 
    dplyr::arrange(Station, Date)
}

import_hcidirectory <- function() {
  #' Healthcare Institutions Directory
  #' 
  #' @description
  #' The \href{http://hcidirectory.sg/hcidirectory/}{Healthcare Institutions 
  #' (HCI) Directory}, an initiative by the Ministry of Health (MOH), is a 
  #' platform for all HCIs licensed under the Private Hospitals and Medical 
  #' Clinics (PHMC) Act to provide information about their services and 
  #' operations to the public.
  #' 
  #' This function is custom-made to consolidate the names and addresses of 
  #' HCIs which are medical clinics that offer general medical services.
  #' 
  #' The HCI Directory is a dynamic web page, so using RSelenium might be 
  #' required.
  #' 
  #' @return The names and addresses of selected HCIs.
  
  # Run a Selenium Server using `RSelenium::rsDriver()`. The parameters e.g. 
  #   `browser`, `chromever` (or `geckover` if using Firefox, or other drivers 
  #   if using other browsers) have to be properly set. Trial-and-error until a 
  #   configuration works. Set `check = T` the very first time it's run on a 
  #   system, then set `check = F` after that to speed things up.
  rD = RSelenium::rsDriver(browser = "chrome",
                           chromever = "83.0.4103.39",
                           check = F)
  
  # Connect to server with a remoteDriver instance.
  remDr = rD$client
  
  # Set timeout on waiting for elements
  remDr$setTimeout(type = "implicit", milliseconds = 10000)
  
  # Navigate to the given URL
  remDr$navigate("http://hcidirectory.sg/hcidirectory/")
  
  # Click 4 things:
  # 1. "MORE SEARCH OPTIONS"
  # 2. "Medical Clinics Only"
  # 3. "General Medical"
  # 4. "Search"
  c(
    "options" = "#moreSearchOptions",
    "medclins" = "#criteria > table > tbody > tr:nth-child(2) > td > label",
    "genmed" = "#isGenMed",
    "search" = "#search_btn_left"
  ) %>% 
    lapply(remDr$findElement, using = "css") %>% 
    purrr::walk(function(elem) elem$clickElement())
  
  # Find the number of pages
  results = remDr$findElement("#results", using = "css")
  npages = results$getElementAttribute("innerHTML")[[1]] %>% 
    xml2::read_html() %>% 
    rvest::html_node("#totalPage") %>% 
    rvest::html_attr("value") %>% 
    as.numeric()
  
  # Create an empty tibble to append results
  df = tibble::tibble(
    id = character(),
    name = character(),
    add = character()
  )
  
  i = 1
  while (T) {
    results = remDr$findElement("#results", using = "css")
    html = results$getElementAttribute("innerHTML")[[1]] %>% 
      xml2::read_html()
    
    # Determine the index numbers of the (up to 10) results on the page
    idx = html %>% 
      # Find the element that says "SHOWING 1 - 10 OF 1,761 RESULTS"
      rvest::html_nodes(".col1") %>% 
      .[1] %>% 
      rvest::html_text() %>% 
      # Commas have to be eliminated for numbers > 999
      gsub(",", "", .) %>% 
      # Find the smallest and largest numbers and apply the colon operator
      sub(".*Showing\\s+(.*)\\s+of.*", "\\1", .) %>% 
      strsplit(split = " - ") %>% 
      unlist() %>% 
      as.numeric() %>% 
      { .[1]:.[2] }
    
    # Only append results if IDs are not in the table
    if (!any(idx %in% df$id)) {
      df = df %>% 
        dplyr::bind_rows(
          html %>%
            # Find both the name and address nodes
            rvest::html_nodes(".name,.add") %>% 
            rvest::html_text() %>% 
            # Tidy whitespace
            gsub("\\s+", " ", .) %>% 
            trimws() %>% 
            # Concatenate IDs, odd rows (names), and even rows (addresses)
            { cbind(idx, .[c(TRUE,FALSE)], .[!c(TRUE,FALSE)]) } %>% 
            tibble::as_tibble() %>% 
            setNames(c("id", "name", "add"))
        )
      
      # Announce progress and increment page counter
      message(i, " of ", npages, " done (", round(i / npages * 100, 2), "%)")
      i = i + 1
    }
    
    # Natural exit point
    if (i > npages) break
    
    # Navigate to the next page (if available, else stop)
    the_end = tryCatch({
      nextpage = remDr$findElement("#PageControl > div.r_arrow", using = "css")
      nextpage$clickElement()
      F
    }, error = function(e) {
      print(paste("There are no more pages after", i))
      T
    })
    
    # Unnatural exit point
    if (the_end) break
  }
  
  # Clean up RSelenium
  remDr$close()
  rD[["server"]]$stop()
  rm(rD, remDr)
  gc()
  
  # Kill Java instance(s) inside RStudio
  # docs.microsoft.com/en-us/windows-server/administration/windows-commands/taskkill
  system("taskkill /im java.exe /f", intern = F, ignore.stdout = F)
  
  # Clean up:
  # - Franchises may have the same name with different addresses
  # - Different practices may have the same zipcodes and even buildings
  # - We will consider each full address unique, and a single practice
  
  # Clean up duplicate addresses
  df %>% 
    .[!duplicated(tolower(.$add), fromLast = T),]
}

zipcodes_to_geocodes <- function(zipcodes) {
  #' Get Geo-location from Google Maps
  #' 
  #' @description
  #' Attempt to obtain the longitudes, latitudes, and addresses of the given 
  #' zipcodes using ggmap::geocode().
  #' 
  #' @param zipcodes A vector of zipcodes.
  #' @return Geo-location data of the associated zipcodes.
  
  # Prompt user to input API key
  ggmap::register_google(key = readline("Please enter Google API key: "))
  
  # Create an (almost) empty tibble to append results
  res = zipcodes %>% 
    # Remove duplicates to minimize number of requests
    .[!duplicated(.)] %>% 
    tibble::as_tibble() %>% 
    dplyr::rename(zip = value) %>% 
    dplyr::mutate(lon = NA_real_,
                  lat = NA_real_,
                  address = NA_character_)
  
  for (i in 1:nrow(res)) {
    result = tryCatch({
      ggmap::geocode(res$zip[i], output = "latlona", source = "google")
    }, warning = function(w) {
      w$message
    }, error = function(e) {
      NA
    })
    
    # If the registered key is invalid, there's no point continuing
    if (grepl("The provided API key is invalid", result[1], fixed = T)) {
      stop("A valid Google API key is required.")
    }
    
    # A useful result will have something, and will have names
    if (!is.na(result) && !is.null(names(result))) {
      res$lon[i] = result$lon
      res$lat[i] = result$lat
      res$address[i] = result$address
    }
    
    # Announce progress
    message(i, " of ", nrow(res), " (",round(i / nrow(res) * 100, 2), "%)")
  }
  
  res
}

The following code chunk has been disabled, but is provided here to show how the data may be obtained from primary sources (with one exception). Using this code chunk would require activating the previous code chunk, as well as a valid Google Maps Platform API key.

grand_import_de_novo <- function() {
  # This might take a while, and requires a Google Maps Platform API key
  
  list(
    "moh_bulletin" = import_moh_weekly(),
    
    "mss_19stations" = import_mss_daily(years = 2012:2020),
    
    "hci_clinics" = import_hcidirectory() %>% 
      dplyr::mutate(zip = sub(".*((?i)singapore\\s+\\d+).*", "\\1", add)) %>% 
      # Requires Google Maps Platform API key
      dplyr::left_join(zipcodes_to_geocodes(.$zip), by = "zip") %>% 
      dplyr::select(-id, -zip, -address) %>% 
      tidyr::drop_na(),
    
    "planning_areas" = read_kmls(paste0(
      "https://geo.data.gov.sg/plan-bdy-dwelling-type-2017/2017/09/27/kml/",
      "plan-bdy-dwelling-type-2017.kml"
    )),
    
    "regions" = read_kmls(paste0(
      "https://geo.data.gov.sg/mp14-region-web-pl/2014/12/05/kml/",
      "mp14-region-web-pl.zip"
    )),
    
    "dengue_polys" = read_kmls(
      c("central", "northeast", "southeast", "southwest") %>% 
        paste0("https://geo.data.gov.sg/denguecase-", .,
               "-area/2020/07/17/kml/denguecase-", ., "-area.kml")
    ),
    
    "aedes_polys" = read_kmls(c(
      c("central", "northeast", "northwest") %>% 
        paste0("https://geo.data.gov.sg/breedinghabitat-", .,
               "-area/2020/07/17/kml/breedinghabitat-", ., "-area.kml"),
      c("southeast", "southwest") %>% 
        paste0("https://geo.data.gov.sg/breedinghabitat-", .,
               "-area/2020/07/23/kml/breedinghabitat-", ., "-area.kml")
    )),
    
    "mss_63station_pos" = readr::read_csv(paste0(
      "https://raw.githubusercontent.com/roscoelai/dasr2020capstone/master/",
        "data/mss/Station_Records.csv"
    ))
  )
}

# raw_data <- grand_import_de_novo()

Import data from saved files

If the data files are available locally, set from_online_repo = F for faster loading. Otherwise, the file will be obtained from an online repository. The .csv files will be read directly while the other file formats will be downloaded to a temporary folder and read locally.

grand_import_no_webscraping <- function(from_online_repo = TRUE) {
  # Allow user to choose whether to import raw data from an online repository 
  #   or from local files.
  
  if (from_online_repo) {
    fld = paste0("https://raw.githubusercontent.com/roscoelai/",
                 "dasr2020capstone/master/data/")
  } else {
    # Check that the "../data/ folder exists
    assertthat::assert_that(dir.exists("../data/"),
                            msg = 'Unable to locate "../data/" directory.')
    fld = "../data/"
  }
  
  list(
    "moh_bulletin" = import_moh_weekly(paste0(
      fld, "moh/weekly-infectious-disease-bulletin-year-2020.xlsx"
    )),
    
    "mss_19stations" = readr::read_csv(paste0(
      fld, "mss/mss_daily_2012_2020_19stations_20200728.csv"
    )),
    
    "hci_clinics" = readr::read_csv(paste0(
      fld, "hcid/hci_clinics_20200725.csv"
    )),
    
    "planning_areas" = read_kmls(paste0(
      fld, "kmls/plan-bdy-dwelling-type-2017.kml"
    )),
    
    "regions" = read_kmls(paste0(
      fld, "kmls/MP14_REGION_WEB_PL.kml"
    )),
    
    "dengue_polys" = read_kmls(paste0(
      fld, "kmls/denguecase-", c("central",
                                 "northeast",
                                 "southeast",
                                 "southwest"), "-area.kml"
    )),
    
    "aedes_polys" = read_kmls(paste0(
      fld, "kmls/breedinghabitat-", c("central",
                                      "northeast",
                                      "northwest",
                                      "southeast",
                                      "southwest"), "-area.kml"
    )),
    
    "mss_63station_pos" = readr::read_csv(paste0(
      fld, "mss/Station_Records.csv"
    ))
  )
}

raw_data <- grand_import_no_webscraping(from_online_repo = F)

What raw data do we have?

raw_data %>% 
  tibble::tibble(
    names = names(.),
    data = .,
    nrow = sapply(., nrow),
    ncol = sapply(., ncol)
  )
# A tibble: 8 x 4
  names             data                   nrow  ncol
  <chr>             <named list>          <int> <int>
1 moh_bulletin      <tibble [470 x 46]>     470    46
2 mss_19stations    <tibble [58,489 x 7]> 58489     7
3 hci_clinics       <tibble [1,739 x 4]>   1739     4
4 planning_areas    <tibble [55 x 3]>        55     3
5 regions           <tibble [5 x 3]>          5     3
6 dengue_polys      <tibble [1,203 x 3]>   1203     3
7 aedes_polys       <tibble [315 x 3]>      315     3
8 mss_63station_pos <tibble [63 x 9]>        63     9
raw_data
$moh_bulletin
# A tibble: 470 x 46
   Epiyear Epiweek Start               End                 Cholera Paratyphoid
     <dbl>   <dbl> <dttm>              <dttm>                <dbl>       <dbl>
 1    2012       1 2012-01-01 00:00:00 2012-01-07 00:00:00       0           1
 2    2012       2 2012-01-08 00:00:00 2012-01-14 00:00:00       0           1
 3    2012       3 2012-01-15 00:00:00 2012-01-21 00:00:00       0           1
 4    2012       4 2012-01-22 00:00:00 2012-01-28 00:00:00       0           1
 5    2012       5 2012-01-29 00:00:00 2012-02-04 00:00:00       0           3
 6    2012       6 2012-02-05 00:00:00 2012-02-11 00:00:00       0           2
 7    2012       7 2012-02-12 00:00:00 2012-02-18 00:00:00       0           1
 8    2012       8 2012-02-19 00:00:00 2012-02-25 00:00:00       0           4
 9    2012       9 2012-02-26 00:00:00 2012-03-03 00:00:00       0           4
10    2012      10 2012-03-04 00:00:00 2012-03-10 00:00:00       0           2
# ... with 460 more rows, and 40 more variables: Typhoid <dbl>, `Acute Viral
#   Hepatitis A` <dbl>, `Acute Viral Hepatitis E` <dbl>, Poliomyelitis <dbl>,
#   Plague <dbl>, `Yellow Fever` <dbl>, Dengue <dbl>, DHF <dbl>, Malaria <dbl>,
#   Chikungunya <dbl>, HFMD <dbl>, Diphtheria <dbl>, Measles <dbl>,
#   Mumps <dbl>, Rubella <dbl>, SARS <dbl>, Nipah <dbl>, `Acute Viral hepatitis
#   B` <dbl>, Encephalitis <dbl>, Legionellosis <dbl>, `Campylobacter
#   enteritis` <dbl>, `Acute Viral hepatitis C` <dbl>, Leptospirosis <dbl>,
#   Melioidosis <dbl>, `Meningococcal Infection` <dbl>, Pertussis <dbl>,
#   `Pneumococcal Disease (invasive)` <dbl>, `Haemophilus influenzae type
#   b` <dbl>, `Salmonellosis(non-enteric fevers)` <dbl>, `Avian
#   Influenza` <dbl>, Zika <dbl>, `Ebola Virus Disease` <dbl>, `Japanese
#   Encephalitis` <dbl>, Tetanus <dbl>, Botulism <dbl>, `Murine Typhus` <dbl>,
#   `Acute Upper Respiratory Tract infections` <dbl>, `Acute
#   Conjunctivitis` <dbl>, `Acute Diarrhoea` <dbl>, Chickenpox <dbl>

$mss_19stations
# A tibble: 58,489 x 7
   Station   Date       Epiyear Epiweek Rainfall Mean_temp Temp_range
   <chr>     <date>       <dbl>   <dbl>    <dbl>     <dbl>      <dbl>
 1 Admiralty 2012-01-01    2012       1      0        27.3       7.40
 2 Admiralty 2012-01-02    2012       1      0        27.1       5.8 
 3 Admiralty 2012-01-03    2012       1      0        26.9       4.60
 4 Admiralty 2012-01-04    2012       1      0        26.8       7   
 5 Admiralty 2012-01-05    2012       1      0        26.4       6.5 
 6 Admiralty 2012-01-06    2012       1      0        27.1       6.40
 7 Admiralty 2012-01-07    2012       1      1.4      26.5       6.20
 8 Admiralty 2012-01-08    2012       2      3.4      24.8       2.40
 9 Admiralty 2012-01-09    2012       2      2.6      25.4       3.4 
10 Admiralty 2012-01-10    2012       2      8.6      24.7       2.80
# ... with 58,479 more rows

$hci_clinics
# A tibble: 1,739 x 4
   name                   add                                          lon   lat
   <chr>                  <chr>                                      <dbl> <dbl>
 1 DA CLINIC @ BISHAN     BLK 501 BISHAN STREET 11 #01-374 Singapor~  104.  1.35
 2 SINGHEALTH POLYCLINIC~ 1 TAMPINES STREET 41 TAMPINES POLYCLINIC ~  104.  1.36
 3 1 MEDICAL TECK GHEE    BLK 410 ANG MO KIO AVENUE 10 TECK GHEE SQ~  104.  1.36
 4 115 EASTPOINT CLINIC ~ BLK 115 BEDOK NORTH RD #01-301 Singapore ~  104.  1.33
 5 18 CLINIC              BLK 101 TOWNER ROAD #01-228 Singapore 322~  104.  1.32
 6 1AESTHETICS, MEDICAL ~ 8 EU TONG SEN STREET THE CENTRAL #14-90 S~  104.  1.29
 7 326 AVENUE 3 CLINIC    BLK 326 SERANGOON AVE 3 #01-382 Singapore~  104.  1.35
 8 338 FAMILY CLINIC      BLK 338 ANG MO KIO AVE 1 #01-1615 Singapo~  104.  1.36
 9 57 MEDICAL CLINIC (GE~ BLK 57 GEYLANG BAHRU #01-3505 Singapore 3~  104.  1.32
10 8 MEDICAL AESTHETIC C~ 51 CUPPAGE ROAD #01-03 Singapore 229469     104.  1.30
# ... with 1,729 more rows

$planning_areas
Simple feature collection with 55 features and 2 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
geographic CRS: WGS 84
# A tibble: 55 x 3
   Name   Description                                                   geometry
   <chr>  <chr>                                               <MULTIPOLYGON [°]>
 1 kml_1  "<center><table><tr><th colsp~ (((103.8572 1.396537, 103.8574 1.39629~
 2 kml_2  "<center><table><tr><th colsp~ (((103.9319 1.343094, 103.9355 1.33956~
 3 kml_3  "<center><table><tr><th colsp~ (((103.8492 1.362753, 103.8494 1.36268~
 4 kml_4  "<center><table><tr><th colsp~ (((103.6973 1.307544, 103.6973 1.30755~
 5 kml_5  "<center><table><tr><th colsp~ (((103.7641 1.370011, 103.7644 1.36947~
 6 kml_6  "<center><table><tr><th colsp~ (((103.8172 1.294353, 103.8174 1.29433~
 7 kml_7  "<center><table><tr><th colsp~ (((103.7745 1.390289, 103.775 1.386071~
 8 kml_8  "<center><table><tr><th colsp~ (((103.7977 1.348128, 103.7981 1.34778~
 9 kml_9  "<center><table><tr><th colsp~ (((103.9018 1.309745, 103.9015 1.30954~
10 kml_10 "<center><table><tr><th colsp~ (((103.9081 1.309816, 103.9082 1.30910~
# ... with 45 more rows

$regions
Simple feature collection with 5 features and 2 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
geographic CRS: WGS 84
# A tibble: 5 x 3
  Name      Description                                                 geometry
  <chr>     <chr>                                             <MULTIPOLYGON [°]>
1 CENTRAL ~ "<html xmlns:fo=\"http://ww~ (((103.8487 1.363027, 103.8482 1.36328~
2 EAST REG~ "<html xmlns:fo=\"http://ww~ (((103.9532 1.382015, 103.9528 1.38216~
3 NORTH RE~ "<html xmlns:fo=\"http://ww~ (((103.7766 1.45145, 103.7766 1.45142,~
4 NORTH-EA~ "<html xmlns:fo=\"http://ww~ (((103.8971 1.415018, 103.8971 1.41503~
5 WEST REG~ "<html xmlns:fo=\"http://ww~ (((103.6973 1.307544, 103.6973 1.30754~

$dengue_polys
Simple feature collection with 1203 features and 2 fields
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: 103.6848 ymin: 1.269624 xmax: 103.967 ymax: 1.414322
geographic CRS: WGS 84
# A tibble: 1,203 x 3
   Name   Description                                                   geometry
   <chr>  <chr>                                                    <POLYGON [°]>
 1 kml_1  "<center><table><tr><th colsp~ ((103.843 1.322077, 103.843 1.323886, ~
 2 kml_2  "<center><table><tr><th colsp~ ((103.8484 1.327504, 103.8484 1.329312~
 3 kml_3  "<center><table><tr><th colsp~ ((103.8681 1.327503, 103.8681 1.329312~
 4 kml_4  "<center><table><tr><th colsp~ ((103.843 1.3474, 103.843 1.349208, 10~
 5 kml_5  "<center><table><tr><th colsp~ ((103.8448 1.3474, 103.8448 1.349208, ~
 6 kml_6  "<center><table><tr><th colsp~ ((103.8591 1.322077, 103.8591 1.323886~
 7 kml_7  "<center><table><tr><th colsp~ ((103.8699 1.322077, 103.8699 1.323886~
 8 kml_8  "<center><table><tr><th colsp~ ((103.8322 1.320269, 103.8322 1.322077~
 9 kml_9  "<center><table><tr><th colsp~ ((103.834 1.320269, 103.834 1.322077, ~
10 kml_10 "<center><table><tr><th colsp~ ((103.8555 1.320269, 103.8555 1.322077~
# ... with 1,193 more rows

$aedes_polys
Simple feature collection with 315 features and 2 fields
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: 103.6848 ymin: 1.27505 xmax: 103.9652 ymax: 1.446879
geographic CRS: WGS 84
# A tibble: 315 x 3
   Name   Description                                                   geometry
   <chr>  <chr>                                                    <POLYGON [°]>
 1 kml_1  "<center><table><tr><th colsp~ ((103.8537 1.370913, 103.8537 1.372722~
 2 kml_2  "<center><table><tr><th colsp~ ((103.8448 1.367296, 103.8448 1.369104~
 3 kml_3  "<center><table><tr><th colsp~ ((103.8358 1.378148, 103.8358 1.379957~
 4 kml_4  "<center><table><tr><th colsp~ ((103.8861 1.378147, 103.8861 1.379956~
 5 kml_5  "<center><table><tr><th colsp~ ((103.8537 1.365487, 103.8537 1.367296~
 6 kml_6  "<center><table><tr><th colsp~ ((103.8466 1.370913, 103.8466 1.372722~
 7 kml_7  "<center><table><tr><th colsp~ ((103.8717 1.37453, 103.8717 1.376339,~
 8 kml_8  "<center><table><tr><th colsp~ ((103.8843 1.376339, 103.8843 1.378147~
 9 kml_9  "<center><table><tr><th colsp~ ((103.8897 1.390808, 103.8897 1.392617~
10 kml_10 "<center><table><tr><th colsp~ ((103.8286 1.314842, 103.8286 1.316651~
# ... with 305 more rows

$mss_63station_pos
# A tibble: 63 x 9
   Station `Lat.(N)` `Long. (E)` `Period of Dail~ `Period of 30,6~
   <chr>       <dbl>       <dbl> <chr>            <chr>           
 1 Paya L~      1.35        104. Jan 1980-current <NA>            
 2 Macrit~      1.34        104. Jan 1980-current Jan 2014-current
 3 Lower ~      1.37        104. Jan 2010-current Jan 2014-current
 4 Tengah       1.39        104. Jan 1980-current <NA>            
 5 Changi       1.37        104. Jan 1981-current Jan 2014-current
 6 Seletar      1.42        104. Jan 1980-current <NA>            
 7 Pasir ~      1.39        104. Jan 1980-current Jan 2014-current
 8 Kampon~      1.27        104. Jan 1980-current Jan 2014-Oct 20~
 9 Jurong~      1.31        104. Jan 1980-current Jan 2014-current
10 Ulu Pa~      1.33        104. Jan 1980-current Jan 2014-current
# ... with 53 more rows, and 4 more variables: `Period of Mean
#   Temperature` <chr>, `Period of Max and Min Temperature` <chr>, `Period of
#   Mean Wind Speed` <chr>, `Period of Max Wind Speed` <chr>

Tidy & Transform

transpose_html_table <- function(html) {
  xml2::read_html(html) %>% 
    rvest::html_node("table") %>% 
    rvest::html_table() %>% 
    t() %>% 
    `colnames<-`(.[1,]) %>% 
    .[2,]
}

idw_interpolation <- function(points, points_label,
                              polys, polys_label,
                              ordinal = 2) {
  #' Inverse-distance-weighted (IDW) Interpolation
  #' 
  #' @description
  #' Calculate the inverse-distance-weighted (IDW) averages of the values of a 
  #' given set of points to the centroids of a given set of polygons.
  #' 
  #' The ordinal determines the amount of weight given to inverse-distances. 
  #' An ordinal of 0 will return the same value (the arithmetic mean) to all 
  #' centroids. A large ordinal will heavily penalize points that are far away. 
  #' A negative ordinal will result in distance-weighted interpolation - please 
  #' don't do that.
  #' 
  #' @param points sf object containing point features.
  #' @param polys sf object containing polygon features.
  #' @param points_label Column name of the labels in points.
  #' @param polys_label Column name of the labels in polys.
  #' @param ordinal Determines the amount of weight given to inverse-distances.
  #' @return Interpolated values for each polygon.
  
  if (ordinal < 0) {
    stop("You are highly encouraged to use a non-negative ordinal.")
  }
  
  # weights: (nrow(polys) x nrow(points))
  weights = polys %>% 
    sf::st_centroid() %>% 
    sf::st_distance(points) %>% 
    units::set_units(km) %>% 
    { 1 / (.^ordinal) }
  
  # values: (nrow(points) x ncol(values))
  values = points %>% 
    as.data.frame() %>% 
    dplyr::select(-{{ points_label }}, -geometry) %>% 
    as.matrix()
  
  # result: (nrow(polys) x ncol(values))
  result = tibble::as_tibble(weights %*% values / rowSums(weights))
  
  polys %>% 
    tibble::as_tibble() %>% 
    dplyr::select({{ polys_label }}) %>% 
    dplyr::bind_cols(result)
}

find_superset <- function(sub, super, super_label, sub_label = ".") {
  sub %>% 
    sf::st_centroid() %>% 
    sf::st_intersects(super) %>% 
    tibble::as_tibble() %>% 
    dplyr::transmute(
      {{ sub_label }} := sub[[sub_label]][row.id],
      {{ super_label }} := super[[super_label]][col.id]
    )
}

# Store intermediate processed data in this list
data <- list()

We work towards 2 datasets:

  1. data_time containing epidemiological and meteorological data from 2012 to 2020
  2. data_space containing number of dengue cases, Aedes mosquito habitats, clinics, population dwelling types, and most recent meteorological data broken down by the planning areas in Singapore

data_time

The unit of analysis for data_time is the epidemiological week.

The epidemiological data (MOH bulletin) is already organized this way.

The meteorological data is organized by days for each of 19 climate stations. To match the epidemiological data, it has to be aggregated to obtain representative values at the national level by epidemiological weeks.

In this way, how many values are we aggregating for each variable, for each week?

raw_data$mss_19stations %>% 
  tidyr::pivot_longer(Rainfall:Temp_range) %>% 
  tidyr::drop_na() %>% 
  dplyr::count(Epiyear, Epiweek, name) %>% 
  dplyr::select(name, n) %T>% 
  { print(table(.)) } %>% 
  ggplot(aes(x = n, color = name)) +
  geom_histogram(binwidth = 1) +
  facet_grid(name ~ ., scales = "free") + 
  labs(x = "Number of values", y = "Number of weeks") + 
  theme(legend.position = "none")
            n
name          46  51  56  71  73  81  86  89  90  92  94  95  96  97  98  99
  Mean_temp    1   0   0   1   0   2   1   1   3   2   2   3   2   2   4   5
  Rainfall     0   0   1   0   0   0   0   0   0   0   0   0   0   0   0   0
  Temp_range   0   1   0   0   1   1   0   0   1   0   0   2   1   0   1   4
            n
name         100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
  Mean_temp    4   4   3   7   7   5  12  10  14  26  16  16  18   5   4   1
  Rainfall     0   0   1   0   0   1   1   0   1   0   1   1   1   1   1   0
  Temp_range   0   2   0   2   1   0   2   1   2   8   4   5   5   4   8   4
            n
name         116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131
  Mean_temp    2   8   4   6   1   1   5   6  13   9  66   2  10  11  16  23
  Rainfall     2   4   7   7   6   2   3   4  11   9  70   8  15  11  23  19
  Temp_range   4  14  11  20  16  19  19  15  25   7  51   3   2   4   5   7
            n
name         132 133
  Mean_temp   25  55
  Rainfall    38 195
  Temp_range  10 152

Most weeks will be aggregating more than 100 values for all 3 variables.

Now we join the datasets to get data_time.

data[["moh_bulletin_s"]] <- raw_data$moh_bulletin %>% 
  dplyr::select(
    Epiyear:End,
    Dengue,
    HFMD,
    `Acute Upper Respiratory Tract infections`,
    `Acute Diarrhoea`
  ) %>% 
  dplyr::rename(
    URTI = `Acute Upper Respiratory Tract infections`,
    Diarrhoea = `Acute Diarrhoea`
  )

Take a break to clean data for MOH bulletin

# Find the row(s) with error(s)
mask <- data$moh_bulletin_s$End %>% 
  { . < "2012-01-01" | . > "2021-01-03" }

# Personally inspect the erroneous row(s)
data$moh_bulletin_s[mask,]
# A tibble: 1 x 8
  Epiyear Epiweek Start               End                 Dengue  HFMD  URTI
    <dbl>   <dbl> <dttm>              <dttm>               <dbl> <dbl> <dbl>
1    2018       1 2017-12-31 00:00:00 1900-01-01 00:00:00     83   326 3158.
# ... with 1 more variable: Diarrhoea <dbl>
# Correct error(s)
data$moh_bulletin_s[mask, "End"] <- 
  data$moh_bulletin_s$Start[mask] + lubridate::days(6)

# Check correction(s)
data$moh_bulletin_s[mask,]
# A tibble: 1 x 8
  Epiyear Epiweek Start               End                 Dengue  HFMD  URTI
    <dbl>   <dbl> <dttm>              <dttm>               <dbl> <dbl> <dbl>
1    2018       1 2017-12-31 00:00:00 2018-01-06 00:00:00     83   326 3158.
# ... with 1 more variable: Diarrhoea <dbl>
data[["mss_19stations_s"]] <- raw_data$mss_19stations %>% 
  dplyr::group_by(Epiyear, Epiweek) %>% 
  dplyr::summarise(
    Mean_rainfall = mean(Rainfall, na.rm = T),
    Med_rainfall = median(Rainfall, na.rm = T),
    Mean_temp = mean(Mean_temp, na.rm = T),
    Med_temp = median(Mean_temp, na.rm = T),
    Mean_temp_rng = mean(Temp_range, na.rm = T),
    Med_temp_rng = median(Temp_range, na.rm = T),
    .groups = "drop"
  )

data_time <- dplyr::left_join(
  data$moh_bulletin_s,
  data$mss_19stations_s,
  by = c("Epiyear", "Epiweek")
) %>% 
  tidyr::drop_na()

data_time
# A tibble: 444 x 14
   Epiyear Epiweek Start               End                 Dengue  HFMD  URTI
     <dbl>   <dbl> <dttm>              <dttm>               <dbl> <dbl> <dbl>
 1    2012       1 2012-01-01 00:00:00 2012-01-07 00:00:00     74   326 2932.
 2    2012       2 2012-01-08 00:00:00 2012-01-14 00:00:00     64   346 3189.
 3    2012       3 2012-01-15 00:00:00 2012-01-21 00:00:00     60   463 3185.
 4    2012       4 2012-01-22 00:00:00 2012-01-28 00:00:00     50   372 4001.
 5    2012       5 2012-01-29 00:00:00 2012-02-04 00:00:00     84   444 3356.
 6    2012       6 2012-02-05 00:00:00 2012-02-11 00:00:00     87   683 3358.
 7    2012       7 2012-02-12 00:00:00 2012-02-18 00:00:00     65   810 3033.
 8    2012       8 2012-02-19 00:00:00 2012-02-25 00:00:00     50   989 2953.
 9    2012       9 2012-02-26 00:00:00 2012-03-03 00:00:00     55  1115 2855.
10    2012      10 2012-03-04 00:00:00 2012-03-10 00:00:00     45  1139 2868.
# ... with 434 more rows, and 7 more variables: Diarrhoea <dbl>,
#   Mean_rainfall <dbl>, Med_rainfall <dbl>, Mean_temp <dbl>, Med_temp <dbl>,
#   Mean_temp_rng <dbl>, Med_temp_rng <dbl>
dplyr::glimpse(data_time)
Rows: 444
Columns: 14
$ Epiyear       <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,...
$ Epiweek       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
$ Start         <dttm> 2012-01-01, 2012-01-08, 2012-01-15, 2012-01-22, 2012...
$ End           <dttm> 2012-01-07, 2012-01-14, 2012-01-21, 2012-01-28, 2012...
$ Dengue        <dbl> 74, 64, 60, 50, 84, 87, 65, 50, 55, 45, 64, 72, 48, 5...
$ HFMD          <dbl> 326, 346, 463, 372, 444, 683, 810, 989, 1115, 1139, 1...
$ URTI          <dbl> 2932.222, 3188.727, 3184.545, 4000.571, 3355.818, 335...
$ Diarrhoea     <dbl> 490.8889, 575.0909, 538.9091, 614.5714, 558.5455, 556...
$ Mean_rainfall <dbl> 4.939850e-01, 4.386466e+00, 9.695489e+00, 4.429323e+0...
$ Med_rainfall  <dbl> 0.0, 1.7, 0.2, 0.0, 0.0, 0.0, 1.2, 0.0, 1.4, 1.4, 0.3...
$ Mean_temp     <dbl> 27.35469, 26.09847, 27.29297, 26.66406, 26.49699, 27....
$ Med_temp      <dbl> 27.35469, 26.09847, 27.29297, 26.66406, 26.49699, 27....
$ Mean_temp_rng <dbl> 6.498496, 5.030827, 7.760150, 6.642105, 6.358647, 7.5...
$ Med_temp_rng  <dbl> 6.4, 4.5, 7.7, 6.8, 6.4, 7.5, 6.5, 6.7, 6.8, 6.7, 6.5...

data_time contains 10 features: 4 indicators of disease numbers and 6 aggregated meteorological measures. Each value is associated with an epidemiological week. The time period spans from 2012-W01 to 2020-W30. As the meteorological records for July 2020 have not been published, there are no corresponding values from 2020-W28 onward.

data_space

A lot more wrangling is involved.

The unit of analysis for data_space is the planning area. This is specified by raw_data$planning_areas, and the other datasets would have to be transformed to match before merging.

Spatial analysis would be rather limited and more emphasis placed on visualization.

# Polygons are 200 m X 200 m squares. The `Description` contains the number 
#   of occurrences within each polygon. Use a suitable regular expression 
#   pattern to extract the number for each polygon. Find the centroid for 
#   each polygon, then duplicate each row by the number of occurrences.
data[c("dengue_points", "aedes_points")] <- 
  raw_data[c("dengue_polys", "aedes_polys")] %>% 
  lapply(function(sf_tbl_df) {
    sf_tbl_df %>% 
      dplyr::transmute(n = sub(".*: (\\d+).*", "\\1", Description)) %>% 
      sf::st_centroid() %>% 
      .[rep(1:nrow(.), as.numeric(.$n)),]
  })

data[["clinic_points"]] <- raw_data$hci_clinics %>% 
  sf::st_as_sf(coords = c("lon", "lat")) %>% 
  sf::`st_crs<-`("WGS84")

data[["weather_points"]] <- raw_data$mss_19stations %>% 
  dplyr::filter(Epiyear == 2020) %>% 
  dplyr::filter(Epiweek > max(Epiweek) - 4) %>% 
  dplyr::group_by(Station) %>% 
  dplyr::summarise(mean_rainfall = mean(Rainfall, na.rm = T),
                   med_rainfall = median(Rainfall, na.rm = T),
                   mean_temp = mean(Mean_temp, na.rm = T),
                   med_temp = median(Mean_temp, na.rm = T),
                   mean_temp_rng = mean(Temp_range, na.rm = T),
                   med_temp_rng = median(Temp_range, na.rm = T),
                   .groups = "drop") %>% 
  tidyr::drop_na() %>% 
  dplyr::left_join(
    dplyr::select(raw_data$mss_63station_pos, Station:`Long. (E)`),
    by = "Station"
  ) %>% 
  sf::st_as_sf(coords = c("Long. (E)", "Lat.(N)")) %>% 
  sf::`st_crs<-`("WGS84")

data[["planning_areas"]] <- raw_data$planning_areas %>% 
  dplyr::bind_cols(
    .$Description %>% 
      lapply(transpose_html_table) %>% 
      dplyr::bind_rows()
  ) %>% 
  tibble::as_tibble() %>% 
  dplyr::rename_all(tolower) %>% 
  dplyr::rename(plan_area = pln_area_n,
                pop = total) %>% 
  dplyr::mutate(plan_area = tools::toTitleCase(tolower(plan_area)),
                dplyr::across(pop:others, as.numeric)) %>% 
  dplyr::select(-matches("^(name|description|inc|fmel)")) %>% 
  sf::st_as_sf()

data[["regions"]] <- raw_data$regions %>% 
  dplyr::mutate(Name = tools::toTitleCase(tolower(Name)),
                Name = sub(" Region", "", Name)) %>% 
  dplyr::rename(region = Name) %>% 
  dplyr::select(-Description)

# This is long
data_space <- data$planning_areas %>% 
  dplyr::select(-one_to_two_rm:-five_rm_exec_flats, -others) %>% 
  
  # Regions
  dplyr::left_join(
    # Changi Bay and Western Islands not assigned, but inconsequential, for now
    find_superset(
      sub = .,
      super = data$regions,
      super_label = "region",
      sub_label = "plan_area"
    ),
    by = "plan_area"
  ) %>% 
  
  # Counts
  dplyr::left_join(
    list(
      ncases = data$dengue_points,
      nhabs = data$aedes_points,
      nclinics = data$clinic_points
    ) %>% {
      lapply(names(.), function(name) {
        find_superset(
          sub = .[[name]],
          super = data$planning_areas,
          super_label = "plan_area"
        ) %>% 
          dplyr::count(plan_area, name = name)
      })
    } %>% 
      Reduce(function(x, y) dplyr::left_join(x, y, by = "plan_area"), .)
  ) %>% 
  
  dplyr::select(plan_area, region, ncases, nhabs, nclinics, everything()) %>% 
  
  # Meteorological data
  dplyr::left_join(
    idw_interpolation(
      points = data$weather_points,
      points_label = "Station",
      polys = .,
      polys_label = "plan_area",
      ordinal = 10
    ),
    by = "plan_area"
  ) %>% 
  
  # Calculations
  dplyr::mutate(
    # Combine East with North-East to increase N for comparison
    region = ifelse(region == "East", "North-East", region),
    
    area_km2 = units::set_units(sf::st_area(.), km^2),
    ncases_p100k = ncases / pop * 100000,
    nclinics_p100k = nclinics / pop * 100000,
    hdb_pp = hdb / pop,
    condos_other_apts_pp = condos_other_apts / pop,
    landed_properties_pp = landed_properties / pop,
    pop_pkm2 = as.numeric(pop / area_km2),
    ncases_pkm2 = as.numeric(ncases / area_km2),
    nclinics_pkm2 = as.numeric(nclinics / area_km2),
    nhabs_pkm2 = as.numeric(nhabs / area_km2),
    
    label = paste0(
      "<b>", plan_area, "</b><br/>",
      "Area (km<sup>2</sup>): ", round(area_km2, 2), "<br/>",
      "Population: ", pop, "<br/>",
      "Cases: ", ncases, "<br/>",
      "Clinics: ", nclinics, "<br/>",
      "<i>Aedes</i> habitats: ", nhabs, "<br/>",
      "Cases (per 100k): ", round(ncases_p100k, 2), "<br/>",
      "Cases (per km<sup>2</sup>): ", round(ncases_pkm2, 2)
    )
  )

data_space

dplyr::glimpse(data_space)

data_space contains 14 main and 9 derived features for each planning area (URA MP14), including: number of dengue cases, number of Aedes mosquito habitats, number of clinics, population, various dwelling types, area, and meteorological variables. Dengue case data were unavailable for several planning areas, especially since data for the north-west region were not published.


Explore Data (Visualization)

plot_heatmap <- function(.data, y, low_color, high_color, subtitle = "") {
  .data %>% 
    dplyr::mutate(
      month = factor(
        lubridate::month(End),
        levels = as.character(1:12),
        labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
                   "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"),
        ordered = TRUE
      )
    ) %>% 
    dplyr::select(Epiyear, month, {{ y }}) %>% 
    tidyr::pivot_longer({{ y }}) %>% 
    dplyr::group_by(Epiyear, month) %>% 
    dplyr::summarise(Cases = sum(value),
                     .groups = "drop") %>% 
    ggplot(aes(x = month, y = Cases, fill = Cases)) + 
    geom_col() + 
    scale_fill_gradient(low = low_color, high = high_color) +
    facet_grid(Epiyear ~ .) + 
    labs(
      title = paste0("Number of ", y, " Cases from 2012-2020"),
      subtitle = subtitle,
      x = "Month",
      y = "Cases",
      caption = "Source: Ministry of Health"
    ) + 
    theme(legend.position = "none")
}

plot_choropleths <- function(.data, .vars) {
  gridExtra::grid.arrange(grobs = lapply(.vars, function(var) {
    ggplot(aes(fill = .data[[var]]), data = .data) + 
      geom_sf() + 
      scale_fill_viridis_c(guide = guide_colourbar(
        title.position = "top",
        title.hjust = .5,
        barwidth = 10,
        barheight = .5
      )) +
      theme_void() + 
      theme(legend.position = "bottom")
  }), ncol = length(.vars))
}

plot_ggbetweenstats <- function(data, x, y, xlab, ylab) {
  type = ifelse(shapiro.test(data_space[[y]])$p.value < 0.05, "np", "p")
  
  ggstatsplot::ggbetweenstats(
    data = data,
    x = {{ x }},
    y = {{ y }},
    xlab = {{ xlab }},
    ylab = {{ ylab }},
    type = type,
    plot.type = "box",
    effsize.type = "biased",
    nboot = 50,
    pairwise.comparisons = T, 
    pairwise.display = "significant",
    pairwise.annotation = "p.value",
    p.adjust.method = "bonferroni"
  )
}

Time

Monthly Calendar Heatmaps

plot_heatmap(
  data_time,
  y = "URTI",
  low_color = "seagreen2",
  high_color = "honeydew4"
)

plot_heatmap(
  data_time,
  y = "Diarrhoea",
  low_color = "lightyellow",
  high_color = "tomato4"
)

In general, the number of cases of URTI and diarrhea are consistent throughout the years

plot_heatmap(
  data_time,
  y = "HFMD",
  low_color = "lightskyblue",
  high_color = "darkblue"
)

HFMD saw a sharp decrease from 2019.

plot_heatmap(
  data_time,
  y = "Dengue",
  low_color = "ivory3",
  high_color = "deeppink4",
  subtitle = "Highest number of Cases in 2020"
)

DF fluctuated over the years, we observed a dip in number of Dengue cases in 2017 and a sudden spike in 2020.

It is worth noting that there seems like a seasonal trend for DF which peaks in Jul – Aug. The decrease in dengue cases in 2017 could be attributed to several reasons such as, increase effort by NEA and the community in response to Zika outbreak in 2016 and perhaps the local population has built up immunity after outbreaks of dengue fever in the past. The spike in 2020 could be explained by the prolong period of time staying at home and the lack of resistance to a newly circulating strain of dengue virus.

Space

Spatial Heatmaps

plot_choropleths(data_space, c("ncases", "ncases_p100k", "ncases_pkm2"))

From the heat map, we observed that North-east region of Singapore is experiencing a higher number of dengue cases than the other parts of Singapore, with Geylang having the highest number of cases of 372.

plot_choropleths(data_space, c("nhabs", "nhabs_pkm2"))

This pattern corresponds with the number of breeding habitats in Singapore, with higher number of breeding habitats found in the North-east region.

plot_choropleths(data_space, c("mean_rainfall", "med_temp", "med_temp_rng"))

Generally, the temperature and amount of rainfall is quite homogenous in Singapore with Geylang and Hougang having slightly higher temperature than other regions; and South region seems to have more rainfall than other regions too.



ggstatsplot::ggbetweenstats()

plot_ggbetweenstats(
  data = data_space,
  x = "region",
  y = "ncases",
  xlab = "Regions in Singapore",
  ylab = "Number of cases"
)
Registered S3 method overwritten by 'broom.mixed':
  method      from 
  tidy.gamlss broom
Registered S3 methods overwritten by 'lme4':
  method                          from
  cooks.distance.influence.merMod car 
  influence.merMod                car 
  dfbeta.influence.merMod         car 
  dfbetas.influence.merMod        car 

There is a significant difference in dengue cases in Singapore in the last 14 days between the North East (\(\hat{\mu}\) = 127) and West (\(\hat{\mu}\) = 25.6) regions of Singapore (p = 0.045).

plot_ggbetweenstats(
  data = data_space,
  x = "region",
  y = "ncases_p100k",
  xlab = "Regions in Singapore",
  ylab = "Number of cases (per 100,000)"
)

There is a significant difference in dengue case incidence rates (per 100,000 population) in Singapore in the last 14 days between the Central (\(\hat{\mu}\) = 209.53) and West (\(\hat{\mu}\) = 16.64) regions of Singapore (p = 0.002).

plot_ggbetweenstats(
  data = data_space,
  x = "region",
  y = "med_temp",
  xlab = "Regions in Singapore",
  ylab = "Median temperature"
)

There are significant differences in the median temperatures in the past month between the Central (\(\hat{\mu}\) = 28.45) and North (\(\hat{\mu}\) = 27.56) regions (p = 0.017), as well as between the North and North East (\(\hat{\mu}\) = 28.49) regions (p = 0.009).

plot_ggbetweenstats(
  data = data_space,
  x = "region",
  y = "med_temp_rng",
  xlab = "Regions in Singapore",
  ylab = "Median temperature range"
)

Similar differences were present for the median temperature ranges, with differences between the Central (\(\hat{\mu}\) = 5.28) and North (\(\hat{\mu}\) = 5.93) regions (p <= 0.001), as well as between the North and North East (\(\hat{\mu}\) = 5.15) regions (p = 0.001).

data_space %>% 
  tibble::as_tibble() %>% 
  dplyr::select(ncases, nhabs, hdb:landed_properties) %>% 
  tidyr::drop_na() %>% 
  as.matrix() %>% 
  Hmisc::rcorr() %>% 
  broom::tidy() %>% 
  dplyr::filter(p.value < 0.05) %>% 
  dplyr::arrange(estimate)
# A tibble: 7 x 5
  column1           column2           estimate     n      p.value
  <chr>             <chr>                <dbl> <int>        <dbl>
1 condos_other_apts ncases               0.374    28 0.0498      
2 condos_other_apts hdb                  0.404    28 0.0329      
3 condos_other_apts nhabs                0.420    28 0.0261      
4 landed_properties ncases               0.574    28 0.00139     
5 landed_properties nhabs                0.646    28 0.000204    
6 landed_properties condos_other_apts    0.710    28 0.0000236   
7 nhabs             ncases               0.839    28 0.0000000239
data_space %>% 
  tibble::as_tibble() %>% 
  dplyr::select(ncases, nhabs, mean_rainfall, med_temp, med_temp_rng) %>% 
  tidyr::drop_na() %>% 
  as.matrix() %>% 
  Hmisc::rcorr() %>% 
  broom::tidy() %>% 
  dplyr::filter(p.value < 0.05) %>% 
  dplyr::arrange(estimate)
# A tibble: 5 x 5
  column1      column2  estimate     n      p.value
  <chr>        <chr>       <dbl> <int>        <dbl>
1 med_temp_rng med_temp   -0.618    28 0.000458    
2 med_temp_rng nhabs      -0.479    28 0.00989     
3 med_temp     nhabs       0.501    28 0.00663     
4 med_temp     ncases      0.624    28 0.000384    
5 nhabs        ncases      0.839    28 0.0000000239

Leaflet map of case density (per km2) choropleth with individual case clusters overlay.

data_space %>% 
  leaflet::leaflet(width = "100%") %>%
  leaflet::addTiles() %>%
  leaflet::addPolygons(
    weight = 1,
    opacity = 1,
    fillOpacity = 0.7,
    smoothFactor = 0.5,
    fillColor = ~leaflet::colorNumeric("Reds", ncases_pkm2)(ncases_pkm2),
    label = ~lapply(label, htmltools::HTML),
    popup = ~lapply(label, htmltools::HTML)
  ) %>%
  leaflet::addCircleMarkers(
    data = data$dengue_points,
    color = "red",
    radius = 5,
    fillOpacity = 0.5,
    clusterOptions = leaflet::markerClusterOptions()
  ) %>% 
  leaflet::addLabelOnlyMarkers(
    data = sf::st_centroid(data_space),
    label =  ~plan_area,
    labelOptions = leaflet::labelOptions(
      noHide = T,
      textOnly = T,
      direction = "center",
      style = list("color" = "blue"))
  )

Model

models <- list()

Dengue

Let’s plot the x variables (Mean_rainfall, Med_temp, Med_temp_rng) and Y (Dengue)

{par(mfrow = c(3, 2))
  hist(data_time$Dengue, breaks=40)
  hist(data_time$Mean_rainfall, breaks=40)
  hist(data_time$Mean_temp, breaks=40)
  hist(data_time$Med_temp, breaks=40)
  hist(data_time$Mean_temp_rng, breaks=40)
  hist(data_time$Med_temp_rng, breaks=40)
}

ks.test(data_time$Mean_temp, data_time$Med_temp)

    Two-sample Kolmogorov-Smirnov test

data:  data_time$Mean_temp and data_time$Med_temp
D = 0, p-value = 1
alternative hypothesis: two-sided
ks.test(data_time$Mean_temp_rng, data_time$Med_temp_rng)

    Two-sample Kolmogorov-Smirnov test

data:  data_time$Mean_temp_rng and data_time$Med_temp_rng
D = 0.065315, p-value = 0.2999
alternative hypothesis: two-sided

Mean_temp and Med_temp have similar distributions, and Mean_temp_rng and Med_temp_rng have similar distributions, therefore Med_temp and Med_temp_rng were used.

Let’s run a correlation model to see the relationship between variables

data_time %>% 
  dplyr::select(Dengue, Mean_rainfall, Med_temp, Med_temp_rng) %>%
  as.matrix() %>%
  Hmisc::rcorr() %>%
  broom::tidy() %>% 
  # dplyr::filter(p.value < 0.05) %>% 
  dplyr::mutate(dplyr::across(is.numeric, ~round(., 3))) %>% 
  dplyr::arrange(estimate)
# A tibble: 6 x 5
  column1       column2       estimate     n p.value
  <chr>         <chr>            <dbl> <dbl>   <dbl>
1 Med_temp      Mean_rainfall   -0.499   444   0    
2 Med_temp_rng  Dengue          -0.151   444   0.001
3 Mean_rainfall Dengue          -0.052   444   0.278
4 Med_temp_rng  Med_temp         0.031   444   0.51 
5 Med_temp_rng  Mean_rainfall    0.084   444   0.077
6 Med_temp      Dengue           0.201   444   0    

Run a linear regression model

models[["model1_dengue"]] <- 
  lm(Dengue ~ Mean_rainfall + Med_temp + Med_temp_rng, data = data_time)

broom::tidy(models$model1_dengue)
# A tibble: 4 x 5
  term          estimate std.error statistic    p.value
  <chr>            <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)   -1237.      378.       -3.27 0.00114   
2 Mean_rainfall     4.01      2.46      1.63 0.104     
3 Med_temp         61.9      13.2       4.70 0.00000355
4 Med_temp_rng    -43.3      12.1      -3.59 0.000369  

Let’s check the skewness of y variable

e1071::skewness(data_time$Dengue)  # [1] 2.033787
[1] 2.033787

Let’s see how the density plots look like after transformation

{
  par(mfrow = c(2, 2))
  plot(density(data_time$Dengue, na.rm = T), main = "untransformed")
  plot(density(sqrt(data_time$Dengue), na.rm = T), main = "sqrt")
  plot(density(log10(data_time$Dengue), na.rm = T), main = "log10")
  plot(density(1 / data_time$Dengue, na.rm = T), main = "inverse")
}

Not really good. Try them on Model 1

models[["model1_log_dengue"]] <- 
  lm(log10(Dengue) ~ Mean_rainfall + Med_temp + Med_temp_rng, data = data_time)

broom::tidy(models$model1_log_dengue)
# A tibble: 4 x 5
  term          estimate std.error statistic   p.value
  <chr>            <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   -0.309     0.700      -0.441 0.659    
2 Mean_rainfall  0.00294   0.00456     0.644 0.520    
3 Med_temp       0.0993    0.0244      4.06  0.0000572
4 Med_temp_rng  -0.0407    0.0224     -1.82  0.0695   
gvlma::gvlma(models$model1_log_dengue)

Call:
lm(formula = log10(Dengue) ~ Mean_rainfall + Med_temp + Med_temp_rng, 
    data = data_time)

Coefficients:
  (Intercept)  Mean_rainfall       Med_temp   Med_temp_rng  
    -0.309087       0.002939       0.099330      -0.040711  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = models$model1_log_dengue) 

                     Value   p-value                   Decision
Global Stat        28.1144 1.182e-05 Assumptions NOT satisfied!
Skewness            1.9927 1.581e-01    Assumptions acceptable.
Kurtosis           17.8056 2.447e-05 Assumptions NOT satisfied!
Link Function       0.1611 6.882e-01    Assumptions acceptable.
Heteroscedasticity  8.1550 4.294e-03 Assumptions NOT satisfied!
models[["model1_sqrt_dengue"]] <- 
  lm(sqrt(Dengue) ~ Mean_rainfall + Med_temp + Med_temp_rng, data = data_time)

broom::tidy(models$model1_sqrt_dengue)
# A tibble: 4 x 5
  term          estimate std.error statistic    p.value
  <chr>            <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)   -28.7      11.2        -2.57 0.0105    
2 Mean_rainfall   0.0782    0.0727      1.08 0.283     
3 Med_temp        1.74      0.390       4.47 0.00000986
4 Med_temp_rng   -0.974     0.357      -2.73 0.00655   
gvlma::gvlma(models$model1_sqrt_dengue)

Call:
lm(formula = sqrt(Dengue) ~ Mean_rainfall + Med_temp + Med_temp_rng, 
    data = data_time)

Coefficients:
  (Intercept)  Mean_rainfall       Med_temp   Med_temp_rng  
    -28.71254        0.07821        1.74274       -0.97433  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = models$model1_sqrt_dengue) 

                     Value   p-value                   Decision
Global Stat        32.7985 1.313e-06 Assumptions NOT satisfied!
Skewness           22.7656 1.830e-06 Assumptions NOT satisfied!
Kurtosis            0.2636 6.077e-01    Assumptions acceptable.
Link Function       0.5967 4.398e-01    Assumptions acceptable.
Heteroscedasticity  9.1726 2.457e-03 Assumptions NOT satisfied!
models[["model1_inv_dengue"]] <- 
  lm((1 / Dengue) ~ Mean_rainfall + Med_temp + Med_temp_rng, data = data_time)

broom::tidy(models$model1_inv_dengue)
# A tibble: 4 x 5
  term            estimate std.error statistic  p.value
  <chr>              <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)    0.0529    0.0141        3.76  0.000193
2 Mean_rainfall -0.0000488 0.0000916    -0.533 0.594   
3 Med_temp      -0.00164   0.000491     -3.34  0.000896
4 Med_temp_rng   0.000310  0.000449      0.689 0.491   
gvlma::gvlma(models$model1_inv_dengue)

Call:
lm(formula = (1/Dengue) ~ Mean_rainfall + Med_temp + Med_temp_rng, 
    data = data_time)

Coefficients:
  (Intercept)  Mean_rainfall       Med_temp   Med_temp_rng  
    0.0528561     -0.0000488     -0.0016408      0.0003096  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = models$model1_inv_dengue) 

                     Value   p-value                   Decision
Global Stat        200.447 0.000e+00 Assumptions NOT satisfied!
Skewness           133.155 0.000e+00 Assumptions NOT satisfied!
Kurtosis            41.572 1.136e-10 Assumptions NOT satisfied!
Link Function        1.142 2.853e-01    Assumptions acceptable.
Heteroscedasticity  24.578 7.135e-07 Assumptions NOT satisfied!

Let’s have a look at the models’ adjusted R2 values

list(
  "untransformed" = models$model1_dengue,
  "log" = models$model1_log_dengue,
  "sqrt" = models$model1_sqrt_dengue,
  "inv" = models$model1_inv_dengue
) %>% 
  lapply(broom::glance) %>% 
  { dplyr::bind_cols(names(.), dplyr::bind_rows(.)) } %>% 
  dplyr::rename(transformation = ...1)
New names:
* NA -> ...1
# A tibble: 4 x 13
  transformation r.squared adj.r.squared   sigma statistic p.value    df logLik
  <chr>              <dbl>         <dbl>   <dbl>     <dbl>   <dbl> <dbl>  <dbl>
1 untransformed     0.0709        0.0646 2.02e+2     11.2  4.34e-7     3 -2985.
2 log               0.0472        0.0407 3.75e-1      7.26 9.19e-5     3  -193.
3 sqrt              0.0599        0.0535 5.98e+0      9.34 5.34e-6     3 -1422.
4 inv               0.0292        0.0226 7.53e-3      4.41 4.53e-3     3  1543.
# ... with 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>,
#   df.residual <int>, nobs <int>

It seems that models$model1_dengue (untransformed) has a higher adjusted R2 value than the other models, and it explains about 6% of the variance.

Plot the regression models

data_time %>% 
  dplyr::select(Dengue, Med_temp, Med_temp_rng, Mean_rainfall) %>% 
  # scale() %>% 
  tibble::as_tibble() %>% 
  tidyr::pivot_longer(-Dengue) %>% 
  ggplot(aes(x = value, y = Dengue, color = name)) +
  geom_point(alpha = 0.4) + 
  geom_smooth(method = "lm", formula = y ~ x, color = "blue") + 
  facet_grid(. ~ name, scales = "free") + 
  theme(legend.position = "none")

24 hr temperature mean per month is 27.5 mean Rainfall is 13.8

Since rainfall has marginal significance at an alpha of 10% and known to have effect on vector born diseases, we decided to explore rainfall variable by categorizing it (<11mm of rain fall and >=11mm of rainfall)

data_time <- data_time %>% 
  dplyr::mutate(Rain_11 = ifelse(Mean_rainfall >= 11, 1, 0))

dplyr::glimpse(data_time)
Rows: 444
Columns: 15
$ Epiyear       <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,...
$ Epiweek       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
$ Start         <dttm> 2012-01-01, 2012-01-08, 2012-01-15, 2012-01-22, 2012...
$ End           <dttm> 2012-01-07, 2012-01-14, 2012-01-21, 2012-01-28, 2012...
$ Dengue        <dbl> 74, 64, 60, 50, 84, 87, 65, 50, 55, 45, 64, 72, 48, 5...
$ HFMD          <dbl> 326, 346, 463, 372, 444, 683, 810, 989, 1115, 1139, 1...
$ URTI          <dbl> 2932.222, 3188.727, 3184.545, 4000.571, 3355.818, 335...
$ Diarrhoea     <dbl> 490.8889, 575.0909, 538.9091, 614.5714, 558.5455, 556...
$ Mean_rainfall <dbl> 4.939850e-01, 4.386466e+00, 9.695489e+00, 4.429323e+0...
$ Med_rainfall  <dbl> 0.0, 1.7, 0.2, 0.0, 0.0, 0.0, 1.2, 0.0, 1.4, 1.4, 0.3...
$ Mean_temp     <dbl> 27.35469, 26.09847, 27.29297, 26.66406, 26.49699, 27....
$ Med_temp      <dbl> 27.35469, 26.09847, 27.29297, 26.66406, 26.49699, 27....
$ Mean_temp_rng <dbl> 6.498496, 5.030827, 7.760150, 6.642105, 6.358647, 7.5...
$ Med_temp_rng  <dbl> 6.4, 4.5, 7.7, 6.8, 6.4, 7.5, 6.5, 6.7, 6.8, 6.7, 6.5...
$ Rain_11       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,...

Re-Run Model1 With categorized Rain variable

models[["model2_dengue"]] <- 
  lm(Dengue ~ Rain_11 + Med_temp + Med_temp_rng, data = data_time)

gvlma::gvlma(models$model2_dengue)

Call:
lm(formula = Dengue ~ Rain_11 + Med_temp + Med_temp_rng, data = data_time)

Coefficients:
 (Intercept)       Rain_11      Med_temp  Med_temp_rng  
    -1219.75         62.59         61.56        -42.21  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = models$model2_dengue) 

                     Value   p-value                   Decision
Global Stat        803.970 0.000e+00 Assumptions NOT satisfied!
Skewness           218.166 0.000e+00 Assumptions NOT satisfied!
Kurtosis           556.830 0.000e+00 Assumptions NOT satisfied!
Link Function        2.482 1.151e-01    Assumptions acceptable.
Heteroscedasticity  26.493 2.645e-07 Assumptions NOT satisfied!
broom::tidy(models$model2_dengue)
# A tibble: 4 x 5
  term         estimate std.error statistic     p.value
  <chr>           <dbl>     <dbl>     <dbl>       <dbl>
1 (Intercept)   -1220.      351.      -3.48 0.000556   
2 Rain_11          62.6      28.1      2.22 0.0266     
3 Med_temp         61.6      12.3      5.01 0.000000805
4 Med_temp_rng    -42.2      12.0     -3.53 0.000463   

Add Interaction term

models[["model2_dengue_int"]] <- 
  lm(Dengue ~ (Med_temp + Med_temp_rng) * Rain_11, data = data_time)

gvlma::gvlma(models$model2_dengue_int)

Call:
lm(formula = Dengue ~ (Med_temp + Med_temp_rng) * Rain_11, data = data_time)

Coefficients:
         (Intercept)              Med_temp          Med_temp_rng  
            -1187.07                 56.90                -26.90  
             Rain_11      Med_temp:Rain_11  Med_temp_rng:Rain_11  
            -1861.85                 97.09               -111.60  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = models$model2_dengue_int) 

                    Value   p-value                   Decision
Global Stat        513.22 0.000e+00 Assumptions NOT satisfied!
Skewness           166.77 0.000e+00 Assumptions NOT satisfied!
Kurtosis           310.38 0.000e+00 Assumptions NOT satisfied!
Link Function       16.41 5.092e-05 Assumptions NOT satisfied!
Heteroscedasticity  19.65 9.293e-06 Assumptions NOT satisfied!
broom::tidy(models$model2_dengue_int)
# A tibble: 6 x 5
  term                 estimate std.error statistic   p.value
  <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)           -1187.      382.      -3.10 0.00204  
2 Med_temp                 56.9      13.1      4.34 0.0000179
3 Med_temp_rng            -26.9      13.3     -2.02 0.0435   
4 Rain_11               -1862.     1006.      -1.85 0.0649   
5 Med_temp:Rain_11         97.1      39.4      2.46 0.0142   
6 Med_temp_rng:Rain_11   -112.       32.7     -3.41 0.000702 
anova(models$model2_dengue,models$model2_dengue_int)
Analysis of Variance Table

Model 1: Dengue ~ Rain_11 + Med_temp + Med_temp_rng
Model 2: Dengue ~ (Med_temp + Med_temp_rng) * Rain_11
  Res.Df      RSS Df Sum of Sq      F   Pr(>F)   
1    440 17909105                                
2    438 17396320  2    512786 6.4554 0.001726 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
broom::glance(models$model2_dengue)
# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1    0.0757        0.0694  202.      12.0 1.44e-7     3 -2984. 5979. 5999.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom::glance(models$model2_dengue_int)
# A tibble: 1 x 12
  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.102        0.0919  199.      9.97 4.82e-9     5 -2978. 5970. 5998.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

There is an interaction effect (Rain_11 on 2 other predictors) and Anova test shows adding the interaction terms improves the model fit (p value=0.0017). This interaction model has a higher R2 value too.

r.squared(models$Model2_dengue) –> 0.0757
r.squared(models$Model2_dengue_int) –> 0.102
adj.r.squared(models$Model2_dengue) –> 0.0694
adj.r.squared(models$Model2_dengue_int) –> 0.0919

Unfortunately, all model assumptions failed. So This model won’t be a good one for prediction.

Lets explore this interaction model

jtools::export_summs(
  models$model2_dengue,
  models$model2_dengue_int,
  error_format = "(p = {p.value})",
  model.names = c("Main Effects Only", "Interaction Effects"),
  digits = 3
)
Main Effects OnlyInteraction Effects
(Intercept)-1219.750 ***-1187.070 ** 
(p = 0.001)   (p = 0.002)   
Rain_1162.587 *  -1861.847    
(p = 0.027)   (p = 0.065)   
Med_temp61.565 ***56.900 ***
(p = 0.000)   (p = 0.000)   
Med_temp_rng-42.213 ***-26.895 *  
(p = 0.000)   (p = 0.044)   
Med_temp:Rain_11        97.086 *  
        (p = 0.014)   
Med_temp_rng:Rain_11        -111.602 ***
        (p = 0.001)   
N444        444        
R20.076    0.102    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Probing Interaction Effects

Run Spotlight analysis, using Johnson-Neyman Techniques

Median temperature

interactions::sim_slopes(
  models$model2_dengue_int,
  pred = Rain_11,
  modx = Med_temp, 
  johnson_neyman = T
)  # [23.43, 27.15]
JOHNSON-NEYMAN INTERVAL 

When Med_temp is OUTSIDE the interval [23.43, 27.15], the slope of Rain_11
is p < .05.

Note: The range of observed values of Med_temp is [25.06, 29.96]

SIMPLE SLOPES ANALYSIS 

Slope of Rain_11 when Med_temp = 27.11 (- 1 SD): 

   Est.    S.E.   t val.      p
------- ------- -------- ------
  52.49   28.96     1.81   0.07

Slope of Rain_11 when Med_temp = 27.96 (Mean): 

    Est.    S.E.   t val.      p
-------- ------- -------- ------
  134.46   38.04     3.53   0.00

Slope of Rain_11 when Med_temp = 28.80 (+ 1 SD): 

    Est.    S.E.   t val.      p
-------- ------- -------- ------
  216.42   65.35     3.31   0.00
interactions::interact_plot(
  models$model2_dengue_int,
  pred="Med_temp",
  modx = "Rain_11",
  modx.labels = c("less than 11mm of rain fall", "rain fall >=11mm"),
  interval = T,
  int.width = .95,
  colors = c("deeppink", "darkgreen"),
  vary.lty = T,
  line.thickness = 1,
  legend.main = ""
) + 
  geom_vline(xintercept = 27.15, col = "red", linetype = 3, size = 1) + 
  labs(
    title = "Effect of Rain fall and Temperature on Dengue cases in Singapore",
    subtitle = paste0(
      "With an average weekly rainfall of 11mm and above, the number of ",
      "Dengue cases increases with rise in temperature"
    ),
    x = "Weekly Median Temperature",
    y = "Number of Dengue Cases",
    caption = "Source: MOH, NEA Websites"
  ) + 
  annotate(
    "text", 
    x = 28, 
    y = 0,
    label = paste(
      "The shaded areas denote 95% confidence intervals.",
      "The vertical line marks the boundary between",
      "regions of significance and non-significance",
      "based on alpha at 5%",
      sep = "\n"
    )
  ) + 
  theme(legend.position = "top")

Median Temperature Range

interactions::sim_slopes(
  models$model2_dengue_int,
  pred = Rain_11,
  modx = Med_temp_rng,
  johnson_neyman = T
)  # [7.01, 9.04]
JOHNSON-NEYMAN INTERVAL 

When Med_temp_rng is OUTSIDE the interval [7.01, 9.04], the slope of
Rain_11 is p < .05.

Note: The range of observed values of Med_temp_rng is [3.50, 8.50]

SIMPLE SLOPES ANALYSIS 

Slope of Rain_11 when Med_temp_rng = 5.63 (- 1 SD): 

    Est.    S.E.   t val.      p
-------- ------- -------- ------
  223.99   53.48     4.19   0.00

Slope of Rain_11 when Med_temp_rng = 6.43 (Mean): 

    Est.    S.E.   t val.      p
-------- ------- -------- ------
  134.46   38.04     3.53   0.00

Slope of Rain_11 when Med_temp_rng = 7.24 (+ 1 SD): 

   Est.    S.E.   t val.      p
------- ------- -------- ------
  44.92   37.56     1.20   0.23
interactions::interact_plot(
  models$model2_dengue_int,
  pred="Med_temp_rng",
  modx = "Rain_11",
  modx.labels = c("less than 11mm of rain fall", "rain fall >=11mm"), 
  interval = T,
  int.width = .95,
  colors = c("deeppink", "darkgreen"),
  vary.lty = T,
  line.thickness = 1,
  legend.main = "Rain Fall"
) + 
  geom_vline(xintercept = 7.01, col = "red", linetype = 3, size = 1) + 
  labs(
    title = paste0(
      "Effect of Rain fall and Temperature Variation ",
      "on Dengue Cases in Singapore"
    ),
    subtitle = paste0(
      "With an average weekly rainfall of 11mm and above, the number of ",
      "Dengue cases decreases when tempeature variation is HIGH"
    ), 
    x = "Weekly Median Temperature Range",
    y = "Number of Dengue Cases",
    caption = "Source: MOH, NEA Websites"
  ) + 
  annotate(
    "text",
    x = 6,
    y = 100,
    label = paste(
      "The shaded areas denote 95% confidence intervals.",
      "The vertical line marks the boundary between",
      "regions of significance and non-significance",
      "based on alpha at 5%",
      sep = "\n"
    )
  ) + 
  theme(legend.position = "top")

Upper Respiratory Tract Infections (URTI)

Since the number of URTI cases in Singapore drastically reduced due to COVID-19, we decided to exclude the values since April 2020 (from circuit breaker period), for this analysis.

Let’s plot the x variables (Mean_rainfall, Med_temp, Med_temp_rng) and Y (URTI)

data_time_URTI <- data_time %>% 
  dplyr::filter(URTI >= 2000)

{
  par(mfrow = c(2, 2))
  hist(data_time_URTI$Mean_rainfall, breaks=40)
  hist(data_time_URTI$Med_temp, breaks=40)
  hist(data_time_URTI$Med_temp_rng, breaks=40)
  hist(data_time_URTI$URTI, breaks=40)
}

Let’s run a correlation model to see the relationship between variables

data_time_URTI %>% 
  dplyr::select(URTI, Mean_rainfall, Med_temp, Med_temp_rng) %>%
  as.matrix() %>%
  Hmisc::rcorr() %>%
  broom::tidy() %>% 
  # dplyr::filter(p.value < 0.05) %>% 
  dplyr::mutate(dplyr::across(is.numeric, ~round(., 3))) %>% 
  dplyr::arrange(estimate)
# A tibble: 6 x 5
  column1       column2       estimate     n p.value
  <chr>         <chr>            <dbl> <dbl>   <dbl>
1 Med_temp      Mean_rainfall   -0.508   429   0    
2 Med_temp      URTI            -0.195   429   0    
3 Med_temp_rng  URTI            -0.086   429   0.075
4 Mean_rainfall URTI            -0.042   429   0.383
5 Med_temp_rng  Med_temp         0.027   429   0.577
6 Med_temp_rng  Mean_rainfall    0.091   429   0.06 

Run a simple linear regression model

models[["model1_URTI"]] <- 
  lm(URTI ~ Mean_rainfall + Med_temp + Med_temp_rng, data = data_time_URTI)

broom::tidy(models$model1_URTI)
# A tibble: 4 x 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     6804.     729.        9.34 5.59e-19
2 Mean_rainfall    -15.7      4.74     -3.31 1.01e- 3
3 Med_temp        -133.      25.5      -5.22 2.75e- 7
4 Med_temp_rng     -30.5     23.2      -1.31 1.91e- 1
gvlma::gvlma(models$model1_URTI)

Call:
lm(formula = URTI ~ Mean_rainfall + Med_temp + Med_temp_rng, 
    data = data_time_URTI)

Coefficients:
  (Intercept)  Mean_rainfall       Med_temp   Med_temp_rng  
      6804.45         -15.68        -132.97         -30.45  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = models$model1_URTI) 

                     Value   p-value                   Decision
Global Stat        119.874 0.000e+00 Assumptions NOT satisfied!
Skewness            48.234 3.782e-12 Assumptions NOT satisfied!
Kurtosis            48.947 2.630e-12 Assumptions NOT satisfied!
Link Function        3.667 5.550e-02    Assumptions acceptable.
Heteroscedasticity  19.026 1.290e-05 Assumptions NOT satisfied!

Assumption checks failed

Let’s check the skewness of y variable

e1071::skewness(data_time_URTI$URTI)  # [1] 0.8634216
[1] 0.8634216
#Lets see how the density plots look like after transformation
{
  par(mfrow = c(2, 2))
  plot(density(data_time_URTI$URTI), main = "untransformed")
  plot(density(sqrt(data_time_URTI$URTI)), main = "sqrt")
  plot(density(log10(data_time_URTI$URTI)), main = "log10")
  plot(density(1 / data_time_URTI$URTI), main = "inverse")
}

Try them on Model 1

models[["model1_log_URTI"]] <- data_time_URTI %>% 
  lm(log10(URTI) ~ Mean_rainfall + Med_temp + Med_temp_rng, data = .)

broom::tidy(models$model1_log_URTI)
# A tibble: 4 x 5
  term          estimate std.error statistic   p.value
  <chr>            <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    4.03     0.110        36.8  3.38e-134
2 Mean_rainfall -0.00232  0.000713     -3.26 1.22e-  3
3 Med_temp      -0.0195   0.00383      -5.09 5.50e-  7
4 Med_temp_rng  -0.00523  0.00350      -1.50 1.35e-  1
gvlma::gvlma(models$model1_log_URTI)

Call:
lm(formula = log10(URTI) ~ Mean_rainfall + Med_temp + Med_temp_rng, 
    data = .)

Coefficients:
  (Intercept)  Mean_rainfall       Med_temp   Med_temp_rng  
     4.034784      -0.002320      -0.019479      -0.005233  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = models$model1_log_URTI) 

                    Value   p-value                   Decision
Global Stat        22.357 0.0001701 Assumptions NOT satisfied!
Skewness            8.284 0.0039991 Assumptions NOT satisfied!
Kurtosis            1.968 0.1606558    Assumptions acceptable.
Link Function       3.159 0.0755032    Assumptions acceptable.
Heteroscedasticity  8.946 0.0027807 Assumptions NOT satisfied!
models[["model1_sqrt_URTI"]] <- data_time_URTI %>% 
  lm(sqrt(URTI) ~ Mean_rainfall + Med_temp + Med_temp_rng, data = .)

broom::tidy(models$model1_sqrt_URTI)
# A tibble: 4 x 5
  term          estimate std.error statistic  p.value
  <chr>            <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)     89.6      6.76       13.3  7.25e-34
2 Mean_rainfall   -0.144    0.0439     -3.29 1.09e- 3
3 Med_temp        -1.22     0.236      -5.16 3.73e- 7
4 Med_temp_rng    -0.303    0.215      -1.41 1.60e- 1
gvlma::gvlma(models$model1_sqrt_URTI)

Call:
lm(formula = sqrt(URTI) ~ Mean_rainfall + Med_temp + Med_temp_rng, 
    data = .)

Coefficients:
  (Intercept)  Mean_rainfall       Med_temp   Med_temp_rng  
      89.6374        -0.1444        -1.2182        -0.3032  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = models$model1_sqrt_URTI) 

                    Value   p-value                   Decision
Global Stat        52.462 1.104e-10 Assumptions NOT satisfied!
Skewness           23.091 1.545e-06 Assumptions NOT satisfied!
Kurtosis           12.591 3.877e-04 Assumptions NOT satisfied!
Link Function       3.415 6.462e-02    Assumptions acceptable.
Heteroscedasticity 13.365 2.563e-04 Assumptions NOT satisfied!
models[["model1_inv_URTI"]] <- data_time_URTI %>% 
  lm((1 / URTI) ~ Mean_rainfall + Med_temp + Med_temp_rng, data = .)

broom::tidy(models$model1_inv_URTI)
# A tibble: 4 x 5
  term             estimate   std.error statistic    p.value
  <chr>               <dbl>       <dbl>     <dbl>      <dbl>
1 (Intercept)   -0.000108   0.0000902       -1.20 0.231     
2 Mean_rainfall  0.00000185 0.000000586      3.16 0.00168   
3 Med_temp       0.0000154  0.00000315       4.89 0.00000143
4 Med_temp_rng   0.00000475 0.00000288       1.65 0.0997    
gvlma::gvlma(models$model1_inv_URTI)

Call:
lm(formula = (1/URTI) ~ Mean_rainfall + Med_temp + Med_temp_rng, 
    data = .)

Coefficients:
  (Intercept)  Mean_rainfall       Med_temp   Med_temp_rng  
   -1.083e-04      1.854e-06      1.541e-05      4.745e-06  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = models$model1_inv_URTI) 

                     Value p-value                Decision
Global Stat        6.13667 0.18917 Assumptions acceptable.
Skewness           0.31090 0.57713 Assumptions acceptable.
Kurtosis           0.03011 0.86224 Assumptions acceptable.
Link Function      2.65447 0.10326 Assumptions acceptable.
Heteroscedasticity 3.14119 0.07634 Assumptions acceptable.

Assumptions acceptable

Plot the regression model 1

data_time_URTI %>% 
  dplyr::mutate(URTI_inv = (1 / URTI)) %>% 
  dplyr::select(URTI_inv, Med_temp, Med_temp_rng, Mean_rainfall) %>% 
  # scale() %>% 
  tibble::as_tibble() %>% 
  tidyr::pivot_longer(-URTI_inv) %>% 
  ggplot(aes(x = value, y = URTI_inv, color = name)) +
  geom_point(alpha = 0.4) + 
  geom_smooth(method = "lm", formula = y ~ x, color = "blue") + 
  facet_grid(. ~ name, scales = "free") + 
  theme(legend.position = "none")

From the plot, there are more cases when rain fall is lesser. So we decided to explore rainfall variable by categorizing it (<11cm of rain fall and >=11cm of rainfall).

data_time_URTI <- data_time_URTI %>% 
  dplyr::mutate(Rain_11 = ifelse(Mean_rainfall >= 11, 1, 0))

dplyr::glimpse(data_time_URTI)
Rows: 429
Columns: 15
$ Epiyear       <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,...
$ Epiweek       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
$ Start         <dttm> 2012-01-01, 2012-01-08, 2012-01-15, 2012-01-22, 2012...
$ End           <dttm> 2012-01-07, 2012-01-14, 2012-01-21, 2012-01-28, 2012...
$ Dengue        <dbl> 74, 64, 60, 50, 84, 87, 65, 50, 55, 45, 64, 72, 48, 5...
$ HFMD          <dbl> 326, 346, 463, 372, 444, 683, 810, 989, 1115, 1139, 1...
$ URTI          <dbl> 2932.222, 3188.727, 3184.545, 4000.571, 3355.818, 335...
$ Diarrhoea     <dbl> 490.8889, 575.0909, 538.9091, 614.5714, 558.5455, 556...
$ Mean_rainfall <dbl> 4.939850e-01, 4.386466e+00, 9.695489e+00, 4.429323e+0...
$ Med_rainfall  <dbl> 0.0, 1.7, 0.2, 0.0, 0.0, 0.0, 1.2, 0.0, 1.4, 1.4, 0.3...
$ Mean_temp     <dbl> 27.35469, 26.09847, 27.29297, 26.66406, 26.49699, 27....
$ Med_temp      <dbl> 27.35469, 26.09847, 27.29297, 26.66406, 26.49699, 27....
$ Mean_temp_rng <dbl> 6.498496, 5.030827, 7.760150, 6.642105, 6.358647, 7.5...
$ Med_temp_rng  <dbl> 6.4, 4.5, 7.7, 6.8, 6.4, 7.5, 6.5, 6.7, 6.8, 6.7, 6.5...
$ Rain_11       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,...

Re-Run Model1 With categorised Rain variable

models[["model2_inv_URTI"]] <- 
  lm((1 / URTI) ~ Rain_11 + Med_temp + Med_temp_rng, data = data_time_URTI)

gvlma::gvlma(models$model2_inv_URTI)

Call:
lm(formula = (1/URTI) ~ Rain_11 + Med_temp + Med_temp_rng, data = data_time_URTI)

Coefficients:
 (Intercept)       Rain_11      Med_temp  Med_temp_rng  
  -3.362e-05     1.536e-05     1.290e-05     5.447e-06  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = models$model2_inv_URTI) 

                     Value p-value                Decision
Global Stat        7.11459  0.1300 Assumptions acceptable.
Skewness           0.06411  0.8001 Assumptions acceptable.
Kurtosis           0.01816  0.8928 Assumptions acceptable.
Link Function      3.41266  0.0647 Assumptions acceptable.
Heteroscedasticity 3.61966  0.0571 Assumptions acceptable.
broom::tidy(models$model2_inv_URTI)
# A tibble: 4 x 5
  term            estimate  std.error statistic   p.value
  <chr>              <dbl>      <dbl>     <dbl>     <dbl>
1 (Intercept)  -0.0000336  0.0000842     -0.399 0.690    
2 Rain_11       0.0000154  0.00000683     2.25  0.0251   
3 Med_temp      0.0000129  0.00000296     4.36  0.0000162
4 Med_temp_rng  0.00000545 0.00000288     1.89  0.0589   

Add Interaction term

models[["model2_inv_URTI_int"]] <- 
  lm((1 / URTI) ~ (Med_temp + Med_temp_rng) * Rain_11, data = data_time_URTI)

gvlma::gvlma(models$model2_inv_URTI_int)

Call:
lm(formula = (1/URTI) ~ (Med_temp + Med_temp_rng) * Rain_11, 
    data = data_time_URTI)

Coefficients:
         (Intercept)              Med_temp          Med_temp_rng  
          -1.161e-04             1.590e-05             5.151e-06  
             Rain_11      Med_temp:Rain_11  Med_temp_rng:Rain_11  
           8.132e-04            -3.302e-05             1.568e-05  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma::gvlma(x = models$model2_inv_URTI_int) 

                    Value p-value                Decision
Global Stat        4.8987 0.29786 Assumptions acceptable.
Skewness           0.1427 0.70559 Assumptions acceptable.
Kurtosis           0.3799 0.53765 Assumptions acceptable.
Link Function      0.7336 0.39173 Assumptions acceptable.
Heteroscedasticity 3.6424 0.05632 Assumptions acceptable.
broom::tidy(models$model2_inv_URTI_int)
# A tibble: 6 x 5
  term                    estimate  std.error statistic     p.value
  <chr>                      <dbl>      <dbl>     <dbl>       <dbl>
1 (Intercept)          -0.000116   0.0000917      -1.27 0.206      
2 Med_temp              0.0000159  0.00000315      5.05 0.000000645
3 Med_temp_rng          0.00000515 0.00000318      1.62 0.106      
4 Rain_11               0.000813   0.000254        3.21 0.00144    
5 Med_temp:Rain_11     -0.0000330  0.0000101      -3.27 0.00116    
6 Med_temp_rng:Rain_11  0.0000157  0.00000837      1.87 0.0617     
anova(models$model2_inv_URTI,models$model2_inv_URTI_int)
Analysis of Variance Table

Model 1: (1/URTI) ~ Rain_11 + Med_temp + Med_temp_rng
Model 2: (1/URTI) ~ (Med_temp + Med_temp_rng) * Rain_11
  Res.Df        RSS Df  Sum of Sq      F   Pr(>F)   
1    425 9.6001e-07                                 
2    423 9.3611e-07  2 2.3896e-08 5.3988 0.004839 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
broom::glance(models$model2_inv_URTI)
# A tibble: 1 x 12
  r.squared adj.r.squared   sigma statistic p.value    df logLik    AIC    BIC
      <dbl>         <dbl>   <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1    0.0534        0.0467 4.75e-5      7.99 3.43e-5     3  3664. -7317. -7297.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
broom::glance(models$model2_inv_URTI_int)
# A tibble: 1 x 12
  r.squared adj.r.squared   sigma statistic p.value    df logLik    AIC    BIC
      <dbl>         <dbl>   <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
1    0.0769        0.0660 4.70e-5      7.05 2.40e-6     5  3669. -7324. -7296.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Median temperature is significant at an alpha level of 5%. There is an interaction effect too (Rain_11 on Med_temp) and Anova test shows adding the interaction terms improves the model fit (p value = 0.004838). This interaction model has a higher R2 value too.

r.squared(models$model2_inv_URTI) –> 0.053
r.squared(models$model2_inv_URTI_int) –> 0.077
adj.r.squared(models$model2_inv_URTI) –> 0.0467
adj.r.squared(models$model2_inv_URTI_int) –> 0.0660

jtools::export_summs(
  models$model2_inv_URTI,models$model2_inv_URTI_int,
  error_format = "(p = {p.value})",
  model.names = c("Main Effects Only", "Interaction Effects"),
  digits = 3
)
Main Effects OnlyInteraction Effects
(Intercept)-0.000    -0.000    
(p = 0.690)   (p = 0.206)   
Rain_110.000 *  0.001 ** 
(p = 0.025)   (p = 0.001)   
Med_temp0.000 ***0.000 ***
(p = 0.000)   (p = 0.000)   
Med_temp_rng0.000    0.000    
(p = 0.059)   (p = 0.106)   
Med_temp:Rain_11        -0.000 ** 
        (p = 0.001)   
Med_temp_rng:Rain_11        0.000    
        (p = 0.062)   
N429        429        
R20.053    0.077    
*** p < 0.001; ** p < 0.01; * p < 0.05.

Probing Interaction Effects

# Run Spotlight analysis, using Johnson-Neyman Techniques
# Median Temperature

# interactions::sim_slopes(
#   models$model2_inv_URTI_int,
#   pred = Rain_11,
#   modx = Med_temp,
#   johnson_neyman = T
# )  # [27.28, 28.61]

# Run interaction_plot() by adding benchmark for regions of significance
# Use plot B to see the actual URTI numbers instead of y inverse

# interactions::interact_plot(
#   models$model2_inv_URTI_int,
#   data = data_time_URTI,
#   pred = "Med_temp",
#   modx = "Rain_11",
#   modx.labels = c("less than 11cm of rain fall", "rain fall >=11cm"),
#   interval = T,
#   int.width = .95,
#   colors = c("deeppink","darkgreen"),
#   vary.lty = T,
#   line.thickness = 1,
#   legend.main = ""
# ) + 
#   geom_vline(xintercept = 27.28, col = "red", linetype = 3, size = 1) + 
#   geom_vline(xintercept = 28.61, col = "red", linetype = 3, size = 1) + 
#   labs(
#     title = paste0(
#       "Effect of Rain fall and Temperature on rise in URTI Cases in Singapore"
#     ),
#     subtitle = paste0(
#       "For lower weekly rainfall, the number of URTI cases ",
#       "decreases with rise in temperature"
#     ),
#     x = "Weekly Median Temperature",
#     y = "1/Number of URTI Cases",
#     caption = "Source: MOH, NEA websites"
#   ) + 
#   annotate(
#     "text",
#     x = 28,
#     y = 0.0003,
#     label = paste(
#       "The shaded areas denote 95% confidence intervals.",
#       "The vertical lines mark the boundary between",
#       "regions of significance and non-significance",
#       "based on alpha at 5%",
#       sep = "\n"
#     )
#   ) + 
#   theme(legend.position = "top")
##############################################
#This model model2_URTI is just for plotting purpose.
models[["model2_URTI_int"]] <- 
  lm(URTI ~ (Med_temp + Med_temp_rng) * Rain_11, data = data_time_URTI)

interactions::sim_slopes(
  models$model2_URTI_int,
  pred = Rain_11,
  modx = Med_temp,
  johnson_neyman = T
)  # [27.24, 28.48]
JOHNSON-NEYMAN INTERVAL 

When Med_temp is OUTSIDE the interval [27.24, 28.48], the slope of Rain_11
is p < .05.

Note: The range of observed values of Med_temp is [25.06, 29.96]

SIMPLE SLOPES ANALYSIS 

Slope of Rain_11 when Med_temp = 27.09 (- 1 SD): 

     Est.    S.E.   t val.      p
--------- ------- -------- ------
  -149.99   56.54    -2.65   0.01

Slope of Rain_11 when Med_temp = 27.94 (Mean): 

   Est.    S.E.   t val.      p
------- ------- -------- ------
  79.97   79.73     1.00   0.32

Slope of Rain_11 when Med_temp = 28.78 (+ 1 SD): 

    Est.     S.E.   t val.      p
-------- -------- -------- ------
  309.93   137.96     2.25   0.03
interactions::interact_plot(
  models$model2_URTI_int,
  pred="Med_temp",
  modx = "Rain_11",
  modx.labels = c("less than 11cm of rain fall", "rain fall >=11cm"),
  interval = T,
  int.width = .95,
  colors = c("deeppink", "darkgreen"),
  vary.lty = T,
  line.thickness = 1,
  legend.main = ""
) +
  geom_vline(xintercept = 27.24, col = "red", linetype = 3, size = 1) + 
  geom_vline(xintercept = 28.48, col = "red", linetype = 3, size = 1) + 
  labs(
    title = paste0(
      "Effect of Rain fall and Temperature on rise in URTI Cases in Singapore"
    ),
    subtitle = paste0(
      "For lower weekly rainfall, the number of URTI cases ",
      "decreases with rise in temperature"
    ),
    x = "Weekly Median Temperature",
    y = "Number of URTI Cases",
    caption = "Source: MOH, NEA websites"
  ) +
  annotate(
    "text", 
    x = 28, 
    y = 2000,
    label = paste(
      "The shaded areas denote 95% confidence intervals.",
      "The vertical lines mark the boundary between",
      "regions of significance and non-significance",
      "based on alpha at 5%",
      sep = "\n"
    )
  ) + 
  theme(legend.position = "top")


Interpretation of the Results

Associations between number of DF and URTI in the community with Rainfall and Temperature

In the modeling analyses, all models for the 4 clinical conditions except URTI failed the assumption checks. Despite failing the checks, we continued to test several modeling for DF as we are interested in the disease. The results below will be focusing on URTI and DF. We observed the following results:

  1. In low rainfall (< 11mm), number of URTI cases seemed to increase when weekly median temperature is lower. Whilst in high rainfall (> 11mm), number of URTI cases decreased when weekly median temperature is lower.
  2. In high rainfall (> 11mm), there is an increase in the number of cases for both DF and URTI when weekly median air temperature is higher.
  3. Cases of DF seemed to decrease in high rainfall and low weekly median temperature.

Dengue cases in Singapore in the last 14 days

  1. There is a significant difference in dengue cases in Singapore in the last 14 days between the North East (\(\hat{\mu}\) = 127) and West (\(\hat{\mu}\) = 25.6) regions of Singapore (p = 0.045).
  2. There is a significant difference in dengue case incidence rates (per 100,000 population) in Singapore in the last 14 days between the Central (\(\hat{\mu}\) = 209.53) and West (\(\hat{\mu}\) = 16.64) regions of Singapore (p = 0.002).
  3. There are significant differences in the median temperatures in the past month between the Central (\(\hat{\mu}\) = 28.45) and North (\(\hat{\mu}\) = 27.56) regions (p = 0.017), as well as between the North and North East (\(\hat{\mu}\) = 28.49) regions (p = 0.009).
  4. Similar differences were present for the median temperature ranges, with differences between the Central (\(\hat{\mu}\) = 5.28) and North (\(\hat{\mu}\) = 5.93) regions (p <= 0.001), as well as between the North and North East (\(\hat{\mu}\) = 5.15) regions (p = 0.001).

Implications

Associations between number of DF and URTI in the community with Rainfall and Temperature

It is expected that there will be an increase in URTI cases when the air temperature is low as people are more susceptible to common cold during cold weather. We observed that the number of URTI cases increase when the air temperature is low with low rainfall but a decrease in cases during low temperature with high rainfall. One of the reasons for this phenomenon could be people may wear more layers when the rain is heavier to protect them from the rain and cold.

However, it is interesting to note that there is an increase in number of cases for DF and URTI during high air temperature with high rainfall (>11mm). The increase in DF in warmer temperature is likely attributed to the optimal temperature for mosquito to develop, research has shown that mosquito replicates faster at higher temperature during incubation period (Liu et al., 2017). The increase in URTI under high temperature is interesting and worthy of further exploration especially in tropical countries, understanding the aetiology would be helpful in reducing the cases. A possible explanation suggested for this was indoor transmission in air-conditioned environments accounted for most of the transmission (Tang, Wong and Hon, 2010).

Dengue fever was observed to decrease in high rainfall and low temperature. We understand that precipitation is associated to increase incidence of dengue fever. However, our results seemed to support a recent study that claimed that heavy rainfall could disrupt breeding habitat and mosquitoes’ reproductive cycle by washing away by the rain (Benedum, Seidahmed, Eltahir, Markuzon, 2018).

These findings contribute to the evidence that weather plays a role in DF and URTI and there is no seasonal trend in URTI in Singapore. Therefore, it is important to maintain good personal hygiene at all times.

Dengue cases in Singapore in the last 14 days

With homogeneous weather conditions in Singapore, it is an interesting observation that cases of DF cluster in North East of Singapore. This could be due to several reasons:

  • Population density: Population in the central and North-Eastern parts are higher, with higher density, means more people with plants and littering which give rise to more places for mosquito to breed. However, the significant higher number of cases in the North east persisted even after normalizing by dividing the number of cases with the population in the region.
  • Geography: North-East regions have more reservoir areas that may have contributed to mosquito breeding.

This finding suggests that NEA should step up their effort in reducing the number of breeding habitat in North East of Singapore.


Limitations and Future Directions

We have identified some potential pitfalls for our project.

  1. Failure of assumption checks for the models could be due to several reasons, such as, insufficient data points and clinical conditions defined were too generalized. Future research should utilize daily number of cases to improve the model, as well as, collect more information on the virus and patients to provide a better model.
  2. Limitations in obtaining latest data. The population dataset is not up to date; thus the results have to be analyzed with caution. We also faced difficulty in obtaining data on measures that control mosquitoes, such as, frequency of fumigation, number of offenders whom were caught breeding mosquitoes. These data would be helpful in analyzing the effectiveness of these interventions.

Future research could explore creating model to predict the spread and susceptibility of dengue fever and even the effectiveness of intervention to help in preventing an outbreak.


Appendix

Appendix


Contribution Statement

Prof. Roh’s Note: “Please describe your individual contribution to the team’s project (in detail).”